Closed suberti-ads closed 1 year ago
IS this configuration line correct, because we see space but one would expect ";"? app.preparation-worker.pdu.config.SY_2_VG1.dynProcParams.TileIdentifiers=AFRICA____ NORTH_AMERICA__ SOUTH_AMERICA____ CENTRAL_AMERICA NORTH_ASIA_ WEST_ASIA__ SOUTH_EAST_ASIA ASIAN_ISLANDS AUSTRALASIA__ EUROPE_____ app.preparation-worker.pdu.config.SY_2_VG1___.dynProcParams.TileCoordinates=[-35.0 -26.0 -35.0 60.0 38.0 60.0 38.0 -26.0 -35.0 -26.0];[40.0 -180.0 40.0 -13.0 75.0 -13.0 75.0 -180.0 40.0 -180.0];[0.0 -125.0 0.0 -50.0 50.0 -50.0 50.0 -125.0 0.0 -125.0];[-56.0 -93.0 -56.0 -33.0 25.0 -33.0 25.0 -93.0 -56.0 -93.0];[40.0 45.0 40.0 180.0 75.0 180.0 75.0 45.0 40.0 45.0];[5.0 25.0 5.0 98.0 50.0 98.0 50.0 25.0 5.0 25.0];[5.0 68.0 5.0 147.0 55.0 147.0 55.0 68.0 5.0 68.0];[-12.0 92.0 -12.0 170.0 29.0 170.0 29.0 92.0 -12.0 92.0];[-48.0 95.0 -48.0 180.0 10.0 180.0 10.0 95.0 -48.0 95.0];[25.0 -11.0 25.0 62.0 75.0 62.0 75.0 -11.0 25.0 -11.0];
@SYTHIER-ADS : May I ask why you do think that this is the issue? The answer to your question is very difficult. We are quite sure that this approach is the right one as it is used in the legacy project exactly that way and also indicated by the simulator to use it that way. We are basically not having an up to date documentation for the PUG and just having a few pages on these parameter in the old PUG ICD from 2016 that indeed is asking for a GML and also having a different description on how to use the Identifiers.
We double checked however if there was some routine in the old system that is somehow translating these information into a GML and couldn't find it. The configuration was took exactly that was from the legacy system. It might be that the ICD was updated afterwards and decided to use a more simple configuration as a markup can be attracting errors very easily.
So if you're having any inputs on these parameters, please share them with us. And I would be really interested why you assume this to be the problem, so we are not hunting phantoms here and the actual error is triggered. I agree however that something like this might be caused by a SIGSEV.
The parameter should be added directly to the JO, so theoretically when adding the other format as configuration it should be added the same way as configured. I would recommend however to manually execute the PUG on a failed working directory with modified parameters to see if it actually fixing the issue.
@w-fsi, thanks for the feedback, so what is your analysis?
@SYTHIER-ADS
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.093815 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (25.000000, -11.000000); (25.000000, 62.000000); (75.000000, 62.000000); (75.000000, -11.000000); (25.000000, -11.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.093445 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-48.000000, 95.000000); (-48.000000, 180.000000); (10.000000, 180.000000); (10.000000, 95.000000); (-48.000000, 95.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.093099 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-12.000000, 92.000000); (-12.000000, 170.000000); (29.000000, 170.000000); (29.000000, 92.000000); (-12.000000, 92.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.092843 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (5.000000, 68.000000); (5.000000, 147.000000); (55.000000, 147.000000); (55.000000, 68.000000); (5.000000, 68.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.092572 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (5.000000, 25.000000); (5.000000, 98.000000); (50.000000, 98.000000); (50.000000, 25.000000); (5.000000, 25.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.092290 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (40.000000, 45.000000); (40.000000, 180.000000); (75.000000, 180.000000); (75.000000, 45.000000); (40.000000, 45.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.092016 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-56.000000, -93.000000); (-56.000000, -33.000000); (25.000000, -33.000000); (25.000000, -93.000000); (-56.000000, -93.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.091724 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (0.000000, -125.000000); (0.000000, -50.000000); (50.000000, -50.000000); (50.000000, -125.000000); (0.000000, -125.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.091416 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (40.000000, -180.000000); (40.000000, -13.000000); (75.000000, -13.000000); (75.000000, -180.000000); (40.000000, -180.000000) ]
2023-07-26T16:01:42+00:00 2023-07-26T16:01:42.091108 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-35.000000, -26.000000); (-35.000000, 60.000000); (38.000000, 60.000000); (38.000000, -26.000000); (-35.000000, -26.000000) ]
These logs indicate, that the dynamic processing parameter for the geographic definition of the tiles is working as intended. The processor correctly identifies the list and prints it in a different format, meaning it was successfully parsed.
My first assumption would be, based on the point in time when the error occurs, that the IPF cannot handle the empty dynamic process parameters for pduLength and MtdPDUFrameNumbers well, while they are not needed for TILE PDUs they are still listed for VG1 and V10 products. In order to validate this assumption I would need you to update the tasktable_configmap (Tasktable TaskTable.PUG_SY_2_VG1.03.xml) with the following part:
<List_of_Dyn_ProcParams count="10">
<Dyn_ProcParam>
<Param_Name>hardwareName</Param_Name>
<Param_Type>string</Param_Type>
<Param_Default></Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>orderType</Param_Name> <!-- Timeliness of the product -->
<Param_Type>String</Param_Type>
<Param_Default>NRT</Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>facilityName</Param_Name>
<Param_Type>string</Param_Type>
<Param_Default>MAR</Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>TileCoordinates</Param_Name>
<Param_Type>string</Param_Type>
<Param_Default></Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>TileIdentifiers</Param_Name>
<Param_Type>string</Param_Type>
<Param_Default></Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>QcApply</Param_Name>
<Param_Type>string</Param_Type>
<Param_Default>false</Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>browseStubMode</Param_Name> <!-- Provided by PDGS-->
<Param_Type>String</Param_Type>
<Param_Default>false</Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<!-- Dyn parameter for OLQC-->
<Param_Name>OLQCReportTemplate</Param_Name>
<Param_Type>String</Param_Type>
<Param_Default>OLQC_Main.jasper</Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>baselineCollection</Param_Name> <!-- Provided by PDGS-->
<Param_Type>String</Param_Type>
<Param_Default>002</Param_Default>
</Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>pduType</Param_Name>
<Param_Type>String</Param_Type>
<Param_Default>tile</Param_Default>
</Dyn_ProcParam>
</List_of_Dyn_ProcParams>
In order for this workaround to be applied, the joborder has to be regenerated by the preparation-worker, as this fix is applied to the preparation-worker.
In order to test the workaround proposed by @w-jka , I apply the following procedure:
1. Identify JobOrder with Grafana: 4 errors: JobOrder.126304.xml JobOrder.129238.xml JobOrder.129309.xml JobOrder.129239.xml
2. On mongo DB, get ProductName that triggers the job.:
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":126304}, {"productName":1})
{ "_id" : NumberLong(126304), "productName" : "S3A_SY_2_VG1____20230428T000000_20230428T235900_20230725T143631_GLOBAL____________MAR_D_NT_002.SEN3" }
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":129238}, {"productName":1})
{ "_id" : NumberLong(129238), "productName" : "S3A_SY_2_VG1____20230409T000000_20230409T235900_20230727T030335_GLOBAL____________MAR_D_NT_002.SEN3" }
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":129309}, {"productName":1})
{ "_id" : NumberLong(129309), "productName" : "S3B_SY_2_VG1____20230409T000000_20230409T235900_20230728T030335_GLOBAL____________MAR_D_NT_002.SEN3" }
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":129239}, {"productName":1})
{ "_id" : NumberLong(129239), "productName" : "S3B_SY_2_VG1____20230428T000000_20230428T235900_20230726T151656_GLOBAL____________MAR_D_NT_002.SEN3" }
3. Stop the RS addon
#Gateway
kp get po | grep pug-ntc
The RS addons is stopped
4. Apply the WA Here the diff on tasktable_configmap.yaml (branch rs-1052)
git diff tasktable_configmap.yaml
diff --git a/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml b/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml
index 9faa574..de537eb 100644
--- a/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml
+++ b/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml
@@ -4377,7 +4377,7 @@ data:
<!-- config for OLQC Quality Checks QC-Check Configuration-->
</Cfg_File>
</List_of_Cfg_Files>
- <List_of_Dyn_ProcParams count="12">
+ <List_of_Dyn_ProcParams count="10">
<Dyn_ProcParam>
<Param_Name>hardwareName</Param_Name>
<Param_Type>string</Param_Type>
@@ -4424,16 +4424,6 @@ data:
<Param_Type>String</Param_Type>
<Param_Default>002</Param_Default>
</Dyn_ProcParam>
- <Dyn_ProcParam>
- <Param_Name>pduLength</Param_Name>
- <Param_Type>number</Param_Type>
- <!--Param_Default>180</Param_Default-->
- </Dyn_ProcParam>
- <Dyn_ProcParam>
- <Param_Name>MtdPDUFrameNumbers</Param_Name>
- <Param_Type>number</Param_Type>
- <!--Param_Default>0</Param_Default-->
- </Dyn_ProcParam>
<Dyn_ProcParam>
<Param_Name>pduType</Param_Name>
<Param_Type>String</Param_Type>
5. Delete job in MongoDB
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":126304})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":129238})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":129309})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":129239})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY>
6. Start the RS addon
#Bastion
ansible-playbook deploy-rs-addon.yaml -i inventory/sample/hosts.yaml -e rs_addon_location=../apps/rs-addon/rs-addon-s3-pug-ntc.zip -e stream_name=s3-pug-ntc
#Gateway
kp get po | grep pug-ntc
s3-pug-ntc-part1-execution-worker-v3-c6ddc5cf6-nz644 0/2 Pending 0 6m
s3-pug-ntc-part1-message-filter-v3-56bd94ffc-jjhlv 2/2 Running 0 5m58s
s3-pug-ntc-part1-preparation-worker-v3-d64fb7bb7-l6k8j 2/2 Running 0 5m52s
s3-pug-ntc-part2-housekeep-v3-7669cd666-2dgbh 2/2 Running 0 5m52s
s3-pug-ntc-part2-time-v3-5dbbb5f8f5-qcxrb 2/2 Running 0 5m54s
kp describe cm s3-pug-ntc-tasktables | grep -A55 "PUG_SY_2_VG1.03" | grep "pduLength" | wc -l
0
7. Republish message before the Preparation-Worker
PN_LIST="S3A_SY_2_VG1____20230428T000000_20230428T235900_20230725T143631_GLOBAL____________MAR_D_NT_002 S3B_SY_2_VG1____20230428T000000_20230428T235900_20230726T151656_GLOBAL____________MAR_D_NT_002 S3B_SY_2_VG1____20230409T000000_20230409T235900_20230728T030335_GLOBAL____________MAR_D_NT_002 S3A_SY_2_VG1____20230409T000000_20230409T235900_20230727T030335_GLOBAL____________MAR_D_NT_002"
TOPIC=s3-pug-ntc-part1.message-filter
for PN in $PN_LIST; do
echo "--------------- $PN -----------------"
MESSAGE_FILE=$PN.mf_msg
kubectl -n ${NAMESPACE_KAFKA} exec -ti ${POD} -c ${CONTAINER} -- bash /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server ${BOOTSTRAP_URL} --topic ${TOPIC} --from-beginning --timeout-ms 20000 | grep $PN > $MESSAGE_FILE
kubectl cp -n ${NAMESPACE_KAFKA} -c ${CONTAINER} ${MESSAGE_FILE} ${POD}:/tmp/message.txt
kubectl -n ${NAMESPACE_KAFKA} exec -ti ${POD} -c ${CONTAINER} -- sh -c "/opt/kafka/bin/kafka-console-producer.sh --bootstrap-server ${BOOTSTRAP_URL} --topic ${TOPIC} < /tmp/message.txt"
rm -v $MESSAGE_FILE
done
Jobs have been successfully recreated:
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"},"creationDate":{$gt: ISODate("2023-07-31T13:00:00.000Z")}}, {"productName":1, "generation.state":1});
{ "_id" : NumberLong(132735), "productName" : "S3A_SY_2_VG1____20230428T000000_20230428T235900_20230725T143631_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(132736), "productName" : "S3B_SY_2_VG1____20230428T000000_20230428T235900_20230726T151656_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(132737), "productName" : "S3B_SY_2_VG1____20230409T000000_20230409T235900_20230728T030335_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(132738), "productName" : "S3A_SY_2_VG1____20230409T000000_20230409T235900_20230727T030335_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
Jobs execution are waiting for available node and lag consumption:
Good News! Production successfully done after apply WA:
System_CCB_2023_w31 : The workaround proposed by Werum works. Moved into "Accepted Werum"
We can decrease priority as we have a workaround which works
System_CCB_2023_w34: Successfully tested (fix=wa)
Environment:
Traceability:
Current Behavior: Execution failed with error code 139 for products SY_2_VG1
Expected Behavior: Production successfully done with nominal production input.
Steps To Reproduce: Start 3% production or 24h test
Test execution artefacts (i.e. logs, screenshots…) Execution logs: NewErrorPUG-NTCjobOrder126304.txt
Job generated: Job126304.log
Whenever possible, first analysis of the root cause All issue were on for products SY_2_VG1 sample for joborder JobOrder.126304.xml Product : S3A_SY_2_VG1__20230428T000000_20230428T235900_20230725T143631_GLOBAL__MAR_D_NT_002 interval production:
On log there was following issue during production
No oom kill or event seen for production in this case error occurred at about 2023-07-26T16:00:00 on node-141 hereafter all log at this hour:
No resource issue:
No event kubernetes seen. Issue is reproduce each time i restart execution
no core file has ben found on node or on pod.
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)