COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] SR1 1.14.0 - Some jobs failed with the error "CFI Orbit interpolation failed." #1029

Open Woljtek opened 1 year ago

Woljtek commented 1 year ago

Environment:

Traceability:

Current Behavior: During the NON-REGRESSION test with PREINT, we observed that several executions ended with the following error:

23-07-07T13:37:43.565844 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] NAVATT Reader: The NAVATT cover the processing window.
2023-07-07T13:37:43.570228 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [W] Using alternative orbit file: [1]: /data/localWD/13245/Orbit_Scratch.EEF
2023-07-07T13:37:43.570278 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] Get mission name (may be different from S3 for test purposes)
2023-07-07T13:37:43.570295 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] Mission name key is S3
2023-07-07T13:37:43.570787 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [I] Satellite ID (3A: 129 - 3B: 130 - 3C: 131 - CRYOSAT: 41 - ENVISAT: 21): 129
2023-07-07T13:37:43.573012 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] CFI Orbit interpolation failed.
2023-07-07T13:37:43.573053 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Message error 1: Fatal Error in: OrbitId::init
2023-07-07T13:37:43.573065 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2023-07-07T13:37:43.573125 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Pre Processor task FAILED
2023-07-07T13:37:43.573147 s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k SR1 07.04 [0000001729]: [E] Exiting with EXIT CODE 136

Expected Behavior: The addon shall be able to compute all products (it worked with 1.13.x)

Steps To Reproduce: Play PREINT procedure with the dataset s3://ops-rs-preint/s3/NRT/S3-SR1/input-data/

Test execution artefacts (i.e. logs, screenshots…) image.pngTip: You can attach images or log files by dragging & dropping, selecting or pasting them. Each error is restarted 3 times before being discarded. (8 Jobs in error)

An example NOK Job: (from s3://ops-rs-failed-workdir/s3-sr1-nrt-preint-part2-execution-worker-v18-74b88978fc-g249k_S3B_SR_0_SRA__20230409T214430_20230409T215430_20230410T001919_0599_078_129____LN3_D_NR_002.SEN3_bf0677b7-e91e-46d2-9076-c7dea83f05ee_0/)

Full logs of EW: https://app.zenhub.com/files/398313496/deccd093-7653-4a8b-96b0-d818d54122ba/download

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

suberti-ads commented 1 year ago

Hereafter 3 sample Job for failed processing: Job 13297 job13297.log

Job created by S3B_SR_0_SRA20230409T195322_20230409T200322_20230409T210750_0599_078_128__LN3_D_NR_002.SEN3 Input used: S3B_SR_0_SRA__20230409T194322_20230409T195322_20230409T205329_0599_078_127LN3_D_NR_002.SEN3 S3B_SR_0_SRA20230409T195322_20230409T200322_20230409T210750_0599_078_128__LN3_D_NR_002.SEN3 S3B_SR_0_SRA__20230409T200322_20230409T200343_20230409T224829_0020_078_128LN3_D_NR_002.SEN3

Job 13341 job13341.log

Job created by S3B_SR_0_SRA20230409T200322_20230409T200343_20230409T224829_0020_078_128__LN3_D_NR_002.SEN3 Input used: S3B_SR_0_SRA__20230409T195322_20230409T200322_20230409T210750_0599_078_128LN3_D_NR_002.SEN3 S3B_SR_0_SRA20230409T200322_20230409T200343_20230409T224829_0020_078_128__LN3_D_NR_002.SEN3 S3B_SR_0_SRA__20230409T200343_20230409T201343_20230409T223709_0599_078_128LN3_D_NR_002.SEN3

Job 13342 job13342.log

Job created by: S3B_SR_0_SRA20230409T200343_20230409T201343_20230409T223709_0599_078_128__LN3_D_NR_002.SEN3 Input used: S3B_SR_0_SRA__20230409T200322_20230409T200343_20230409T224829_0020_078_128LN3_D_NR_002.SEN3 S3B_SR_0_SRA20230409T200343_20230409T201343_20230409T223709_0599_078_128__LN3_D_NR_002.SEN3 S3B_SR_0_SRA__20230409T201343_20230409T202343_20230409T224428_0599_078_128LN3_D_NR_002.SEN3

suberti-ads commented 1 year ago

Republished from previous messagre from @w-jka (Deleted by mistake)

From the provided logs and AppDataJob extracts I could not find any problems on our side. As Florian is on vacation this week, I do not have any access to the documentation of the processors, so I can not check if the ICD of the processor contains any additional information regarding the exit code 136.

From the logs the provided orbit files are fine and, while not first in priority, are listed by the new tasktable. The processor itself states that the files are good to go before running into an error. Based on this analysis the root cause of this issue seems to be in the IPF itself.

Woljtek commented 1 year ago

A PSC issue is opened => https://esa-csc-gs.atlassian.net/browse/PSC-63 Wait for ESA anwser. I propose to move this issue to 'On Hold'

LAQU156 commented 1 year ago

Werum_CCB_2023_w28 : Moved into "Refused Werum" to place it into "On hold" pipeline in CCB Board, waiting for ESA answer

w-fsi commented 1 year ago

@Woljtek : I agree with this approach. As @w-jka pointed out, it looks unlikely to be an issue within our software as there was no change on our side and the kind of error looks more like an issue within the IPF itself. Exit code 136 is often associated in C/C++ programs as SIGFPE and might be caused by an exception with a floating point or an integer oveflow. This is very likely an issue within the IPF as the code is executed as blackbox on our side.

pcuq-ads commented 1 year ago

IVV_CCB_2023_w29 : moved to accepted OPS . @SYTHIER-ADS Could you have a look to this issue ?

SYTHIER-ADS commented 1 year ago

My understanding of the issue is that in degraded cases (missing ROE_AX and DO_0_NAV) the CFI is using TM_0_NAT and in this case the initialisation of the orbit fails. This point is linked to the change of the version of EO CFI inside the CFI. I would suggest to decrease to Major this anomaly as it is only impacting a degraded case, noting that the benchmark was performed using this version of the IPF without error (DO_0_NAV are available). In parallel an anomaly is to be created on IPF.

pcuq-ads commented 1 year ago

System_CCB_2023-w30 : The issue is on CFI side for a degraded case. Priority reduced to major.

suberti-ads commented 6 months ago

4 new occurences on SR1-NRT

2024-04-11T17:16:23+00:00   {"header":{"type":"LOG","timestamp":"2024-04-11T17:16:23.065928Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-77-thread-1"},"message":{"content":"Ending task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin with exit code 136"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
2024-04-11T17:16:23+00:00   2024-04-11T17:16:23.063472 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Exiting with EXIT CODE 136
2024-04-11T17:16:23+00:00   2024-04-11T17:16:23.063449 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Pre Processor task FAILED
2024-04-11T17:16:23+00:00   2024-04-11T17:16:23.063392 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2024-04-11T17:16:23+00:00   2024-04-11T17:16:23.063380 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] Message error 1: Fatal Error in: OrbitId::init
2024-04-11T17:16:23+00:00   2024-04-11T17:16:23.063335 s3-sr1-nrt-part1-execution-worker-v16-7d4d859bc4-vnnkd SR1 07.04 [0000000334]: [E] CFI Orbit interpolation failed.

Note : CAMS Ticket on this issue : 4118

suberti-ads commented 3 months ago

4 new occurences on SR1-NRT

[code 290] [exitCode 136] [msg Task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin failed]

with following logs:


2024-07-03T19:40:31+00:00   {"header":{"type":"LOG","timestamp":"2024-07-03T19:40:31.011928Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-17-thread-1"},"message":{"content":"Ending task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin with exit code 136"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009483 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Exiting with EXIT CODE 136
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009458 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Pre Processor task FAILED
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009377 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009365 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] Message error 1: Fatal Error in: OrbitId::init
2024-07-03T19:40:31+00:00   2024-07-03T19:40:31.009317 s3-sr1-nrt-part1-execution-worker-v17-67c7779d6c-wpc6t SR1 07.04 [0000000122]: [E] CFI Orbit interpolation failed.
suberti-ads commented 1 week ago

3 New occurences on SR1-NRT

[code 290] [exitCode 136] [msg Task /usr/local/components/S3IPF_SR1_07.04/bin/SR_1_PRE.bin failed]

with following ipf logs:

2024-10-10T15:30:11+00:00   2024-10-10T15:30:11.106494 s3-sr1-nrt-part1-execution-worker-v18-5db68d99d4-d9zg9 SR1 07.04 [0000000351]: [E] Exiting with EXIT CODE 136
2024-10-10T15:30:11+00:00   2024-10-10T15:30:11.106469 s3-sr1-nrt-part1-execution-worker-v18-5db68d99d4-d9zg9 SR1 07.04 [0000000351]: [E] Pre Processor task FAILED
2024-10-10T15:30:11+00:00   2024-10-10T15:30:11.106414 s3-sr1-nrt-part1-execution-worker-v18-5db68d99d4-d9zg9 SR1 07.04 [0000000351]: [E] Message error 2: EXPLORER_ORBIT >>> WARNING in xo_orbit_init_file: Warnings while computing ANX data
2024-10-10T15:30:11+00:00   2024-10-10T15:30:11.106402 s3-sr1-nrt-part1-execution-worker-v18-5db68d99d4-d9zg9 SR1 07.04 [0000000351]: [E] Message error 1: Fatal Error in: OrbitId::init
2024-10-10T15:30:11+00:00   2024-10-10T15:30:11.106363 s3-sr1-nrt-part1-execution-worker-v18-5db68d99d4-d9zg9 SR1 07.04 [0000000351]: [E] CFI Orbit interpolation failed.