Open suberti-ads opened 1 year ago
IVV_CCB_2023_w29 : moved to WERUM dashboard for analysis.
Werum_CCB_2023_w29: The fix is not needed for phase 1. Move to Icebox. The code reponsible to upload check is common with many other components.
The problem here seems to be that the scaler is killing the pods very aggressively. Normally I would expect to have an error rate way bellow 3% for this kind of scenario as the processing is happening very rarely. It might be checked if that kind of aggressive behaviour is really wanted as it might cause issues also within other services.
The behaviour of denying a reupload of data and also to check the md5sum file is intentional behaviour of the S1PRO system to avoid that already produced are accidently overwritten. Especially as other recovery mechanism would be required as it is not sufficient enough to just update something in the OBS, but also in the PRIP as it is exposing the MD5SUM as well. To be sure that this kind of scenarios are avoid this mechanism was added.
The second proposal of @suberti-ads might be a way to approach as a missing md5sum file is a corrupted scenario on its own and thus might be safe to be removed. However when deleting the existing product or uploading it, the scaler might kill the pod again and causing other weird scenarios. Even if this is fixed, there is still the change that the kafka message is still not be able to be published and interrupted in between there as well.
Handling the upload is realized in the OBS SDK and thus changing the behaviour there might not just affect the compression worker, but all services that are using the OBS. Thus at the current state of the project this might be quite risky and thus is not recommended. This behaviour might be modified after the V2.
As agreed in the CCB yesterday this issue is not supposed to result in a change in the software, but having a estimation on the impact. This had been provided and thus this issue is for me a candidate for refusal. Please let me know if another assessment shall be done.
SYS_CCB_w29 : Tag checked is present. No more action awaited on this ticket. Moved to ON HOLD.
Environment: Delivery tag: release/1.13.2 Platform: OPS Orange Cloud Configuration: processing common 1.13-2-rc1 processing compression : 1.13.1-rc1
Traceability:
Current Behavior: Some compression task interrupted due to pod's end of life (scaler or other root cause) stuck the compression for this product which never sent to prip.
Expected Behavior: Restart should works if compression task were not successfully finished.
Steps To Reproduce: This issue occurred were task was interrupted after writing zip file in compression bucket and before kafka message generation.
Test execution artefacts (i.e. logs, screenshots…) sample for S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002 (see bellow) first execution: with pod compression-part2-compression-worker-medium-v44-c89d56b94-kngdq Explore-logs-2023-07-13 11_46_37.txt second execution with pod compression-part2-compression-worker-medium-v44-c89d56b94-k746q Explore-logs-2023-07-13 11_43_54.txt
Whenever possible, first analysis of the root cause On 3% production we have some already ingested error. For all of them product successfully write in bucket but we observe missing product in prip For all missing product in prip, md5sum file is missing in bucket
product found in PRIP:
For investigation i take input product S3A_OL_1_EFR____20230708T184832_20230708T185132_20230708T215748_0179_101_013_2340_LN3_O_NR_002.SEN3 as sample. First execution on pod compression-part2-compression-worker-medium-v44-c89d56b94-kngdq We observe that the last input execute on this pod. Last log seen for it
so pod uploading product in bucket at this date. It match with product date find in bucket:
So it appears uploading for product successfully done but it didn't finsh to upload md5sum file / Send kafka message to prip / Acknowledge current kafka message. So when other pod (compression-part2-compression-worker-medium-v44-c89d56b94-k746q) start again compression for this one. compression failed with already ingested error.
hereafter scheme if pod stop / killed during red operation we enter in this issue:
To force compression operation should delete product and restart error. So this issue is not blocking. Moreover scaler cooldown duration will be increase to limit these case (from 10 minutes to 1 hours)
To solve this issue i think 2 ways:
Check also presence of md5sum pro: Detection of duplicate still works. against: When pod killed after uploading of md5sum and before message kafka send we will have same issue (low chance but not completely null)
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)