[BUG] [OPS] Some compression task interrupted due to pod's end of life stuck the compression for this product.

suberti-ads commented 1 year ago

Environment: Delivery tag: release/1.13.2 Platform: OPS Orange Cloud Configuration: processing common 1.13-2-rc1 processing compression : 1.13.1-rc1

Traceability:

Current Behavior: Some compression task interrupted due to pod's end of life (scaler or other root cause) stuck the compression for this product which never sent to prip.

Expected Behavior: Restart should works if compression task were not successfully finished.

Steps To Reproduce: This issue occurred were task was interrupted after writing zip file in compression bucket and before kafka message generation.

Test execution artefacts (i.e. logs, screenshots…) sample for S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002 (see bellow) first execution: with pod compression-part2-compression-worker-medium-v44-c89d56b94-kngdq Explore-logs-2023-07-13 11_46_37.txt second execution with pod compression-part2-compression-worker-medium-v44-c89d56b94-k746q Explore-logs-2023-07-13 11_43_54.txt

Whenever possible, first analysis of the root cause On 3% production we have some already ingested error. For all of them product successfully write in bucket but we observe missing product in prip For all missing product in prip, md5sum file is missing in bucket

suberti@refsys-client:~/Documents/investigation$ for i in $(cat ListProduct.txt); do echo $i ; s3cmd ls s3://ops-rs-s3-l1-nrt-zip/$i; s3cmd ls s3://ops-rs-s3-l1-ntc-zip/$i ;s3cmd ls s3://ops-rs-s3-l1-stc-zip/$i;s3cmd ls s3://ops-rs-pug-zip/$i; done;
S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3
2023-06-16 10:46 24524417420   s3://ops-rs-s3-l1-ntc-zip/S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3.zip
2023-06-16 10:49       174   s3://ops-rs-s3-l1-ntc-zip/S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3.zip.md5sum
S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3
2023-07-08 21:59 990647030   s3://ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip
S3A_SL_1_RBT____20230708T204409_20230708T204909_20230709T034243_0299_101_014______LN3_D_NR_002.SEN3
2023-07-09 04:00 2395914978   s3://ops-rs-s3-l1-nrt-zip/S3A_SL_1_RBT____20230708T204409_20230708T204909_20230709T034243_0299_101_014______LN3_D_NR_002.SEN3.zip
S3A_SY_1_MISR___20230708T183926_20230708T184126_20230711T181203_0119_101_013______LN3_D_ST_002.SEN3
2023-07-11 19:28 204775989   s3://ops-rs-s3-l1-stc-zip/S3A_SY_1_MISR___20230708T183926_20230708T184126_20230711T181203_0119_101_013______LN3_D_ST_002.SEN3.zip
S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3
2023-06-16 09:34 24516277392   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3.zip
2023-06-16 09:37       174   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3.zip.md5sum
S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3
2023-06-16 09:48 24524416278   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3.zip
2023-06-16 09:51       174   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3.zip.md5sum
S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3
2023-06-16 09:20 24516277419   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3.zip
2023-06-16 09:23       174   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3.zip.md5sum
S3B_OL_1_EFR____20230708T215428_20230708T215628_20230709T012526_0119_081_257______LN3_D_NR_002.SEN3
2023-07-09 01:32 1111962491   s3://ops-rs-s3-l1-nrt-zip/S3B_OL_1_EFR____20230708T215428_20230708T215628_20230709T012526_0119_081_257______LN3_D_NR_002.SEN3.zip

product found in PRIP:


S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3 ==> OK
S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3 ==> NOK
S3A_SL_1_RBT____20230708T204409_20230708T204909_20230709T034243_0299_101_014______LN3_D_NR_002.SEN3 ==> NOK
S3A_SY_1_MISR___20230708T183926_20230708T184126_20230711T181203_0119_101_013______LN3_D_ST_002.SEN3 ==> NOK
S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3 ==> OK
S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3 ==> OK
S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3 ==> OK
S3B_OL_1_EFR____20230708T215428_20230708T215628_20230709T012526_0119_081_257______LN3_D_NR_002.SEN3 ==> NOK

For investigation i take input product S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3 as sample. First execution on pod compression-part2-compression-worker-medium-v44-c89d56b94-kngdq We observe that the last input execute on this pod. Last log seen for it

2023-07-08T21:59:54+00:00   {"header":{"type":"LOG","timestamp":"2023-07-08T21:59:54.796577Z","level":"WARN","line":36,"file":"Retries.java","thread":"KafkaConsumerDestination{consumerDestinationName='compression-part2.priority-filter-medium', partitions=30, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Error on performing upload to ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip.md5sum (2/11), retrying in 6000ms"},"custom":{"logger_string":"esa.s1pdgs.cpoc.common.utils.Retries"}}

so pod uploading product in bucket at this date. It match with product date find in bucket:

suberti@refsys-client:~/Documents/investigation$ s3cmd  ls   s3://ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip
2023-07-08 21:59 990647030   s3://ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip

So it appears uploading for product successfully done but it didn't finsh to upload md5sum file / Send kafka message to prip / Acknowledge current kafka message. So when other pod (compression-part2-compression-worker-medium-v44-c89d56b94-k746q) start again compression for this one. compression failed with already ingested error.

hereafter scheme if pod stop / killed during red operation we enter in this issue:

To force compression operation should delete product and restart error. So this issue is not blocking. Moreover scaler cooldown duration will be increase to limit these case (from 10 minutes to 1 hours)

To solve this issue i think 2 ways:

Deactivate check product in bucket pro: This will solve completely issue against: This check have been applied to avoid duplicate ingestion in catalog/compression which have been notified by LTA.
Check also presence of md5sum pro: Detection of duplicate still works. against: When pod killed after uploading of md5sum and before message kafka send we will have same issue (low chance but not completely null)

Bug Generic Definition of Ready (DoR)
[ ] The affect version in which the bug has been found is mentioned
[ ] The context and environment of the bug is detailed
[ ] The description of the bug is clear and unambiguous
[ ] The procedure (steps) to reproduce the bug is clearly detailed
[ ] The tested User Story / features is linked to the bug if available
[ ] Logs are attached if available
[ ] A data set attached if available

Bug Generic Definition of Done (DoD)

[ ] the modification implemented (the solution to fix the bug) is described in the bug.
[ ] Unit tests & Continuous integration performed - Test results available - Structural Test coverage reported by SONAR
[ ] Code committed in GIT with right tag or Analysis/Trade Off documentation up-to-date in reference-system-documentation repository
[ ] Code is compliant with coding rules (SONAR Report as evidence)
[ ] Acceptance criteria of the related User story are checked and Passed

pcuq-ads commented 1 year ago

IVV_CCB_2023_w29 : moved to WERUM dashboard for analysis.

Woljtek commented 1 year ago

Werum_CCB_2023_w29: The fix is not needed for phase 1. Move to Icebox. The code reponsible to upload check is common with many other components.

w-fsi commented 1 year ago

The problem here seems to be that the scaler is killing the pods very aggressively. Normally I would expect to have an error rate way bellow 3% for this kind of scenario as the processing is happening very rarely. It might be checked if that kind of aggressive behaviour is really wanted as it might cause issues also within other services.

The behaviour of denying a reupload of data and also to check the md5sum file is intentional behaviour of the S1PRO system to avoid that already produced are accidently overwritten. Especially as other recovery mechanism would be required as it is not sufficient enough to just update something in the OBS, but also in the PRIP as it is exposing the MD5SUM as well. To be sure that this kind of scenarios are avoid this mechanism was added.

The second proposal of @suberti-ads might be a way to approach as a missing md5sum file is a corrupted scenario on its own and thus might be safe to be removed. However when deleting the existing product or uploading it, the scaler might kill the pod again and causing other weird scenarios. Even if this is fixed, there is still the change that the kafka message is still not be able to be published and interrupted in between there as well.

Handling the upload is realized in the OBS SDK and thus changing the behaviour there might not just affect the compression worker, but all services that are using the OBS. Thus at the current state of the project this might be quite risky and thus is not recommended. This behaviour might be modified after the V2.

As agreed in the CCB yesterday this issue is not supposed to result in a change in the software, but having a estimation on the impact. This had been provided and thus this issue is for me a candidate for refusal. Please let me know if another assessment shall be done.

pcuq-ads commented 1 year ago

SYS_CCB_w29 : Tag checked is present. No more action awaited on this ticket. Moved to ON HOLD.

COPRS / rs-issues

[BUG] [OPS] Some compression task interrupted due to pod's end of life stuck the compression for this product. #1037