COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [OPS] Some compression task interrupted due to pod's end of life stuck the compression for this product. #1037

Open suberti-ads opened 1 year ago

suberti-ads commented 1 year ago

Environment: Delivery tag: release/1.13.2 Platform: OPS Orange Cloud Configuration: processing common 1.13-2-rc1 processing compression : 1.13.1-rc1

Traceability:

Current Behavior: Some compression task interrupted due to pod's end of life (scaler or other root cause) stuck the compression for this product which never sent to prip.

Expected Behavior: Restart should works if compression task were not successfully finished.

Steps To Reproduce: This issue occurred were task was interrupted after writing zip file in compression bucket and before kafka message generation.

Test execution artefacts (i.e. logs, screenshots…) sample for S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002 (see bellow) first execution: with pod compression-part2-compression-worker-medium-v44-c89d56b94-kngdq Explore-logs-2023-07-13 11_46_37.txt second execution with pod compression-part2-compression-worker-medium-v44-c89d56b94-k746q Explore-logs-2023-07-13 11_43_54.txt

Whenever possible, first analysis of the root cause On 3% production we have some already ingested error. For all of them product successfully write in bucket but we observe missing product in prip For all missing product in prip, md5sum file is missing in bucket

suberti@refsys-client:~/Documents/investigation$ for i in $(cat ListProduct.txt); do echo $i ; s3cmd ls s3://ops-rs-s3-l1-nrt-zip/$i; s3cmd ls s3://ops-rs-s3-l1-ntc-zip/$i ;s3cmd ls s3://ops-rs-s3-l1-stc-zip/$i;s3cmd ls s3://ops-rs-pug-zip/$i; done;
S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3
2023-06-16 10:46 24524417420   s3://ops-rs-s3-l1-ntc-zip/S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3.zip
2023-06-16 10:49       174   s3://ops-rs-s3-l1-ntc-zip/S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3.zip.md5sum
S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3
2023-07-08 21:59 990647030   s3://ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip
S3A_SL_1_RBT____20230708T204409_20230708T204909_20230709T034243_0299_101_014______LN3_D_NR_002.SEN3
2023-07-09 04:00 2395914978   s3://ops-rs-s3-l1-nrt-zip/S3A_SL_1_RBT____20230708T204409_20230708T204909_20230709T034243_0299_101_014______LN3_D_NR_002.SEN3.zip
S3A_SY_1_MISR___20230708T183926_20230708T184126_20230711T181203_0119_101_013______LN3_D_ST_002.SEN3
2023-07-11 19:28 204775989   s3://ops-rs-s3-l1-stc-zip/S3A_SY_1_MISR___20230708T183926_20230708T184126_20230711T181203_0119_101_013______LN3_D_ST_002.SEN3.zip
S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3
2023-06-16 09:34 24516277392   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3.zip
2023-06-16 09:37       174   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3.zip.md5sum
S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3
2023-06-16 09:48 24524416278   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3.zip
2023-06-16 09:51       174   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3.zip.md5sum
S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3
2023-06-16 09:20 24516277419   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3.zip
2023-06-16 09:23       174   s3://ops-rs-s3-l1-ntc-zip/S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3.zip.md5sum
S3B_OL_1_EFR____20230708T215428_20230708T215628_20230709T012526_0119_081_257______LN3_D_NR_002.SEN3
2023-07-09 01:32 1111962491   s3://ops-rs-s3-l1-nrt-zip/S3B_OL_1_EFR____20230708T215428_20230708T215628_20230709T012526_0119_081_257______LN3_D_NR_002.SEN3.zip

product found in PRIP:


S3A_OL_1_EFR____20230409T204847_20230409T213258_20230615T164903_2651_097_271______LN3_D_NT_002.SEN3 ==> OK
S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3 ==> NOK
S3A_SL_1_RBT____20230708T204409_20230708T204909_20230709T034243_0299_101_014______LN3_D_NR_002.SEN3 ==> NOK
S3A_SY_1_MISR___20230708T183926_20230708T184126_20230711T181203_0119_101_013______LN3_D_ST_002.SEN3 ==> NOK
S3B_OL_1_EFR____20230409T182835_20230409T191245_20230615T230003_2650_078_127______LN3_D_NT_002.SEN3 ==> OK
S3B_OL_1_EFR____20230409T200933_20230409T205344_20230615T135639_2651_078_128______LN3_D_NT_002.SEN3 ==> OK
S3B_OL_1_EFR____20230409T215032_20230409T223442_20230615T195158_2650_078_129______LN3_D_NT_002.SEN3 ==> OK
S3B_OL_1_EFR____20230708T215428_20230708T215628_20230709T012526_0119_081_257______LN3_D_NR_002.SEN3 ==> NOK

For investigation i take input product S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3 as sample. First execution on pod compression-part2-compression-worker-medium-v44-c89d56b94-kngdq We observe that the last input execute on this pod. Last log seen for it

2023-07-08T21:59:54+00:00   {"header":{"type":"LOG","timestamp":"2023-07-08T21:59:54.796577Z","level":"WARN","line":36,"file":"Retries.java","thread":"KafkaConsumerDestination{consumerDestinationName='compression-part2.priority-filter-medium', partitions=30, dlqName='error-warning'}.container-0-C-1"},"message":{"content":"Error on performing upload to ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip.md5sum (2/11), retrying in 6000ms"},"custom":{"logger_string":"esa.s1pdgs.cpoc.common.utils.Retries"}}

so pod uploading product in bucket at this date. It match with product date find in bucket:

suberti@refsys-client:~/Documents/investigation$ s3cmd  ls   s3://ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip
2023-07-08 21:59 990647030   s3://ops-rs-pug-zip/S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3.zip

So it appears uploading for product successfully done but it didn't finsh to upload md5sum file / Send kafka message to prip / Acknowledge current kafka message. So when other pod (compression-part2-compression-worker-medium-v44-c89d56b94-k746q) start again compression for this one. compression failed with already ingested error.

hereafter scheme if pod stop / killed during red operation we enter in this issue:

CompressionTask

To force compression operation should delete product and restart error. So this issue is not blocking. Moreover scaler cooldown duration will be increase to limit these case (from 10 minutes to 1 hours)

To solve this issue i think 2 ways:

Bug Generic Definition of Done (DoD)

pcuq-ads commented 1 year ago

IVV_CCB_2023_w29 : moved to WERUM dashboard for analysis.

Woljtek commented 1 year ago

Werum_CCB_2023_w29: The fix is not needed for phase 1. Move to Icebox. The code reponsible to upload check is common with many other components.

w-fsi commented 1 year ago

The problem here seems to be that the scaler is killing the pods very aggressively. Normally I would expect to have an error rate way bellow 3% for this kind of scenario as the processing is happening very rarely. It might be checked if that kind of aggressive behaviour is really wanted as it might cause issues also within other services.

The behaviour of denying a reupload of data and also to check the md5sum file is intentional behaviour of the S1PRO system to avoid that already produced are accidently overwritten. Especially as other recovery mechanism would be required as it is not sufficient enough to just update something in the OBS, but also in the PRIP as it is exposing the MD5SUM as well. To be sure that this kind of scenarios are avoid this mechanism was added.

The second proposal of @suberti-ads might be a way to approach as a missing md5sum file is a corrupted scenario on its own and thus might be safe to be removed. However when deleting the existing product or uploading it, the scaler might kill the pod again and causing other weird scenarios. Even if this is fixed, there is still the change that the kafka message is still not be able to be published and interrupted in between there as well.

Handling the upload is realized in the OBS SDK and thus changing the behaviour there might not just affect the compression worker, but all services that are using the OBS. Thus at the current state of the project this might be quite risky and thus is not recommended. This behaviour might be modified after the V2.

As agreed in the CCB yesterday this issue is not supposed to result in a change in the software, but having a estimation on the impact. This had been provided and thus this issue is for me a candidate for refusal. Please let me know if another assessment shall be done.

pcuq-ads commented 1 year ago

SYS_CCB_w29 : Tag checked is present. No more action awaited on this ticket. Moved to ON HOLD.