COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] S3-OL1-NTC execution worker failed. #1047

Closed pcuq-ads closed 1 year ago

pcuq-ads commented 1 year ago

Environment:

Traceability:

Current Behavior: 100% of processing S3-OL1-NTC processing failed on the 3% production from last week (2023-07-24).

Expected Behavior: S3-OL1-NTC shall generate production without errors.

Steps To Reproduce: Check production from S3-OL1-NTC.

Test execution artefacts (i.e. logs, screenshots…) Logs are here https://app.zenhub.com/files/398313496/0ec45e07-ad0d-4ee3-a0ec-c11291c86429/download

No error found on ressource. No error found on bucket read/write.

Whenever possible, first analysis of the root cause Hypothesis : Regression with Metadata Extraction or Metadata search controller version 1.14-rc1 ???

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

w-jka commented 1 year ago
2023-07-24T05:40:25Z    {"header":{"type":"LOG","timestamp":"2023-07-24T05:40:25.875714Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-298-thread-1"},"message":{"content":"Ending task /usr/local/components/S3IPF_OL1_06.13/bin/OL1.bin with exit code 255"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
2023-07-24T05:40:25Z    2023-07-24T05:40:25.856503 s3-ol1-ntc-part1-execution-worker-v6-759c85b556-ddgxf IPF-OL-1-EO 06.13 [0000003450]: [E] Got error after call at IPF-OL-1/src/ol1eo_processor.c:main:106
2023-07-24T05:40:25Z    2023-07-24T05:40:25.856480 s3-ol1-ntc-part1-execution-worker-v6-759c85b556-ddgxf IPF-OL-1-EO 06.13 [0000003450]: [E] Got error after call at IPF-OL-1/src/ol1eo_processing.c:ol1eo_processing:442
2023-07-24T05:40:25Z    2023-07-24T05:40:25.856470 s3-ol1-ntc-part1-execution-worker-v6-759c85b556-ddgxf IPF-OL-1-EO 06.13 [0000003450]: [E] Processing step 'ol1co_adf_cal_load' ended in failure
2023-07-24T05:40:25Z    2023-07-24T05:40:25.856459 s3-ol1-ntc-part1-execution-worker-v6-759c85b556-ddgxf IPF-OL-1-EO 06.13 [0000003450]: [E] Got error after call at IPF-OL-1/src/ol1co_adf_cal.c:ol1co_adf_cal_load:192
2023-07-24T05:40:25Z    2023-07-24T05:40:25.856445 s3-ol1-ntc-part1-execution-worker-v6-759c85b556-ddgxf IPF-OL-1-EO 06.13 [0000003450]: [E] Got error after call at common/libs3ipf_packing/src/netcdf_util.c:ncutil_inq_varinfo:73
2023-07-24T05:40:25Z    2023-07-24T05:40:25.856389 s3-ol1-ntc-part1-execution-worker-v6-759c85b556-ddgxf IPF-OL-1-EO 06.13 [0000003450]: [E] NetCDF: Variable not found

From the logs provided there is no information that the root cause is in our application. The root cause seems to be in the libraries of the IPF.

pcuq-ads commented 1 year ago

According to Sylvain, the issue is linked to AUX data baseline.

The error inside the LOG [E] Got error after call at IPF-OL-1/src/ol1co_adf_cal.c:ol1co_adf_cal_load:192 shows that an AUX data used for Centos 7 version has been used. It is not compliant with Centos 6 chain.

Here is the AUX data inside the job order that is the root cause of the issue. S3A_OL_1_EOAX_20160425T103700_20991231T23595920230613T120000____MPC_O_AL_015.SEN3

We will handle this issue with OPS.

suberti-ads commented 1 year ago

Following AUX_DATA have been ingested for preint test v1.14 (Centos 7)

-rw-r--r--    1 root     root       4842217 Jul 17 15:38 S3A_OL_1_CAL_AX_20230620T000000_20991231T235959_20230616T120000___________________MPC_O_AL_028.SEN3.tgz
-rw-r--r--    1 root     root          9142 Jul 17 15:38 S3A_OL_1_EO__AX_20160425T103700_20991231T235959_20230613T120000___________________MPC_O_AL_015.SEN3.tgz
-rw-r--r--    1 root     root       4938869 Jul 17 15:39 S3B_OL_1_CAL_AX_20230620T000000_20991231T235959_20230616T120000___________________MPC_O_AL_018.SEN3.tgz
-rw-r--r--    1 root     root          9124 Jul 17 15:39 S3B_OL_1_EO__AX_20180618T000000_20991231T235959_20230613T120000___________________MPC_O_AL_009.SEN3.tgz

Product have been deleted from catalog and production restarted 5 execution done ==> We can close this issue as incident due to wrong auxiliary data.

pcuq-ads commented 1 year ago

SYS_CCB_W30: this is an incident. The issue is closed.