Closed Woljtek closed 1 year ago
The provided logs seem to be from an old configuration version, as the error is still the one from before the updated joborder.xslt. @Woljtek could you provide a current log?
I deleted the topic TOPIC=s3-pug-part1.preparation-worker
before restarting the chain. So I don't think there is any old job.
The logs are already in the Test execution artefacts section
@Woljtek The logs of the Test execution artefacts still state the following PUG error:
2023-02-15T16:33:55.712163 s3-pug-nrt-part1-execution-worker-v3-554d94c545-tbh62 [0000000096]: [E] [PUGCoreProcessor.C: main:(173)] Unable to load JobOrder from file "/data/localWD/52655/JobOrder.52655.xml --- acs::S3PUGJobOrder::exS3PUGJobOrderException in S3PUGJobOrder.C(659) from virtual void acs::S3PUGJobOrder::read(acs::XMLIstream&) thread "" [140183974770656]
Error while reading job order
caused by:
acs::rsResourceSet::NotFoundException in rsResourceSet.C(860) from const acs::rsResourceSet::rsValue* acs::rsResourceSet::getValue(const std::string&) const thread "" [140183974770656]
Resource not found: List_of_Config_Files.Config_File in namespace "Ipf_Conf"
The dmesg of the pod:
This indicates, that the IPF itself runs into an issue, that seems to be unrelated to our software.
This logs is outdated:
2023-02-15T16:33:55.712163 s3-pug-nrt-part1-execution-worker-v3-554d94c545-tbh62 [0000000096]: [E] [PUGCoreProcessor.C: main:(173)] Unable to load JobOrder from file "/data/localWD/52655/JobOrder.52655.xml --- acs::S3PUGJobOrder::exS3PUGJobOrderException in S3PUGJobOrder.C(659) from virtual void acs::S3PUGJobOrder::read(acs::XMLIstream&) thread "" [140183974770656]
Error while reading job order
caused by:
acs::rsResourceSet::NotFoundException in rsResourceSet.C(860) from const acs::rsResourceSet::rsValue* acs::rsResourceSet::getValue(const std::string&) const thread "" [140183974770656]
Resource not found: List_of_Config_Files.Config_File in namespace "Ipf_Conf"
It is relative to the bug #828 which is now workaround
@w-jka Do you think this behavior is IPF issue or a deployment issue ?
FYI, I am going to increase the EW memory limits to 50Gi according to the prerequistes
I reproduced the same behavior with the limit at 50Gi.
This is actually not to easy to answer. We are observing a sigsev that is invoked from somewhere and killing the process. This is usually a memory violation and thus very unlikely caused by our software. From a deployment perspective, we not finding any issue at the moment that could explain it and the expectation would be that increasing the memory limits is not changing anything as it is not a out of memory issue.
The issue occurs on the libc of the system and is very likely an issue with the processor or the operating system. We might give it a try to use an old version and see if it occurs there also. However without having at least a document giving an idea about the operating system that is required most of it is pure guessing.
We open a PDGSAMON -> https://cams.esa.int/browse/PDGSANOM-12241 I propose to put this issue on hold waiting fro SDP feedback
IVV_CCB_2023_w09 : Moved into "On hold" waiting for ESA feedback
From S-3 L1 IPF Maintainers to Reference System Dears, the WD has been analyzed by the PUG maintenance team. It seems the error comes from a missing value for the HardwareName in the JobOrder. Regards, S3IPF L1 Maintenance team
In order to include this dynamic process parameter, one has to add the following lines to the configuration:
app.preparation-worker.pdu.dyn-proc-params.hardwareName=O
app.housekeep.pdu.dyn-proc-params.hardwareName=O
The allowed values for this parameter are:
O -> (OPE)
F -> (REF)
D -> (DEV)
R -> (REP)
We will add the value O
to the default configuration for now.
IVV_CCB_2023_w13 : Moved into "Accepted OPS", Tests now with @Woljtek and @w-jka
The configuration has been added :
app.preparation-worker.pdu.dyn-proc-params.hardwareName=O
app.housekeep.pdu.dyn-proc-params.hardwareName=O
However, the error is still present (tested on OL_0 products) :
2023-04-14T14:40:50.174363 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001175]: [E] PUGCoreProcessor: Fri Apr 14 14:40:50 2023
PID: 1175 SIGNAL 11 THREAD: 140402505103104
core in: /tmp/core.1175 - stack follow
/usr/local/components/PUG-3.45/bin/../lib/libSignal.so.5.4 ( acs::Signal::catchBadSignal(int) )
/lib64/libpthread.so.0 ( )
/usr/lib64/libstdc++.so.6 ( std::string::assign(std::string const&) )
/usr/local/components/PUG-3.45/bin/../lib/libS3PDUGenerator.so.2.1 ( acs::PDUGeneratorThread::setInfoForStatistics() )
/usr/local/components/PUG-3.45/bin/../lib/libS3PDUGenerator.so.2.1 ( acs::StripeGeneratorThread::createPDU() )
/usr/local/components/PUG-3.45/bin/../lib/libS3PDUGenerator.so.2.1 ( acs::StripeGeneratorThread::run() )
/usr/local/components/PUG-3.45/bin/../lib/../lib/libThread.so.5.16 ( acs::Thread::svc(void*) )
/lib64/libpthread.so.0 ( )
/lib64/libc.so.6 ( clone )
@w-jka How can I check if the hardwareName is taken into account on EW ?
@Woljtek There are two ways. The hardwareName is included in the joborder. If you download the working directory from the failed workdir bucket, you may check the JobOrder.xml file there if it contains the dynamic process parameter. The JobOrder is also printed in the logs, however it is not as nicely formatted. I would advice using the approach with the failed workdir. If you could provide the file in this issue, we can have a look at it as well.
@w-jka Thank for the quick answer.
On a failed JO, I observed that the hadwareName is not filled: Source JobOrder.3211.xml: s3://ops-rs-failed-workdir/s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c_S3B_OL_0_EFR__20230409T182835_20230409T183033_20230409T210042_0118_078_127____LN3_D_NR_002.SEN3_b582e550-fccb-4530-a9ab-15300d897ea6_0/JobOrder.3211.xml
This JO triggers the bug of this issue. Extract from logs for job jobOrder /data/localWD/3211/JobOrder.3211.xml:
2023-04-14T14:31:55.358330 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Loaded configured parameter "ProductTypeConf.OL_0_EFR___.DeltaTime" = <-0.044>
2023-04-14T14:31:55.358423 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Loaded configured parameter "ProductTypeConf.OL_0_EFR___.CheckJOInterval" = <3> [unit: lines]
2023-04-14T14:31:55.358471 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Converted to seconds: "ProductTypeConf.OL_0_EFR___.CheckJOInterval" = <0.132> [unit: s]
2023-04-14T14:31:55.361569 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Processing orbit file [/data/localWD/3211/S3B_AX___FRO_AX_20230409T000000_20230419T000000_20230412T065540___________________EUM_O_AL_001.SEN3]
2023-04-14T14:31:55.369273 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Processing orbit file [/data/localWD/3211/S3B_AX___OSF_AX_20180425T191855_99991231T235959_20221110T110324___________________EUM_O_AL_001.SEN3]
2023-04-14T14:31:55.371772 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Going to uncompress the file [S3B_OPER_MPL_ORBSCT_20180425T191855_99999999T999999_0010.TGZ] if needed
2023-04-14T14:31:55.379445 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Orbit scenario file used for propagator init is [/data/localWD/3211/S3B_AX___OSF_AX_20180425T191855_99991231T235959_20221110T110324___________________EUM_O_AL_001.SEN3/S3B_OPER_MPL_ORBSCT_20180425T191855_99999999T999999_0010.EOF]
2023-04-14T14:31:55.436730 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [I] PUGCoreProcessor: Adding input file /data/localWD/3211/S3B_OL_0_EFR____20230409T182835_20230409T183033_20230409T210042_0118_078_127______LN3_D_NR_002.SEN3 in time interval [2023-04-09T18:28:34.816075, 2023-04-09T18:30:33.049222]
2023-04-14T14:31:55.437123 s3-pug-preint-part1-execution-worker-v3-56cfb6d578-mlt4c PUG_OL_0_EFR 03.45 [0000001103]: [E] PUGCoreProcessor: Fri Apr 14 14:31:55 2023
PID: 1103 SIGNAL 11 THREAD: 139697621907200
core in: /tmp/core.1103 - stack follow
/usr/local/components/PUG-3.45/bin/../lib/libSignal.so.5.4 ( acs::Signal::catchBadSignal(int) )
/lib64/libpthread.so.0 ( )
/usr/lib64/libstdc++.so.6 ( std::string::assign(std::string const&) )
/usr/local/components/PUG-3.45/bin/../lib/libS3PDUGenerator.so.2.1 ( acs::PDUGeneratorThread::setInfoForStatistics() )
/usr/local/components/PUG-3.45/bin/../lib/libS3PDUGenerator.so.2.1 ( acs::StripeGeneratorThread::createPDU() )
/usr/local/components/PUG-3.45/bin/../lib/libS3PDUGenerator.so.2.1 ( acs::StripeGeneratorThread::run() )
/usr/local/components/PUG-3.45/bin/../lib/../lib/libThread.so.5.16 ( acs::Thread::svc(void*) )
/lib64/libpthread.so.0 ( )
/lib64/libc.so.6 ( clone )
{"header":{"type":"LOG","timestamp":"2023-04-14T14:31:55.646145Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-382-thread-1"},"message":{"content":"Ending task /usr/local/components/PUG-3.45/bin/PUGCoreProcessor with exit code 139"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
{"header":{"type":"REPORT","timestamp":"2023-04-14T14:31:55.646000Z","level":"INFO","mission":"S3","workflow":"NOMINAL","rs_chain_name":"S3-PUG-NRT-PREINT","rs_chain_version":"1.12.0-rc1"},"message":{"content":"End Task /usr/local/components/PUG-3.45/bin/PUGCoreProcessor with exit code 139"},"task":{"uid":"b7e8805d-df77-40cd-a459-9f9a308966e8","name":"ProcessingTask","event":"END","status":"OK","output":{},"input":{},"quality":{},"error_code":0,"duration_in_seconds":0.34,"missing_output":[]}}
{"header":{"type":"REPORT","timestamp":"2023-04-14T14:31:55.647000Z","level":"ERROR","mission":"S3","workflow":"NOMINAL","rs_chain_name":"S3-PUG-NRT-PREINT","rs_chain_version":"1.12.0-rc1"},"message":{"content":"[code 290] [exitCode 139] [msg Task /usr/local/components/PUG-3.45/bin/PUGCoreProcessor failed]"},"task":{"uid":"8135cc03-1bb3-4dea-819c-e3c7faae515c","name":"Processing","event":"END","status":"NOK","output":{},"input":{},"quality":{},"error_code":1,"duration_in_seconds":0.342,"missing_output":[]}}
Could you have a look why the hadwareName value is empty whereas the stream.parameter looks well ?
@Woljtek Could you provide the preparation-worker log for this job? It is not available on the cluster anymore, so I could not take a look at it.
The PW is still running (but we changed the default name): s3-pug-preint-part1-preparation-worker-v3-647cfd89c4-6h6qk Logs file: https://app.zenhub.com/files/398313496/4872e4aa-81e0-4f19-81b5-9d83b85b2114/download
@Woljtek Yes I saw that as well, however the earliest logs available per kubectl are from today morning. The job in question however ran last week.
@w-jka
I extracted from Loki all Friday logs with this query {pod="s3-pug-preint-part1-preparation-worker-v3-647cfd89c4-6h6qk"} |= 'AppDataJob 3211'
:
72 hits
https://app.zenhub.com/files/398313496/2d5ac115-6ee5-4bc9-a4ab-9bb5aff85fcc/download
All logs between the 2023-04-14T13:55:26.636725 and 2023-04-14T13:57:26.138833 (first/last hits) https://app.zenhub.com/files/398313496/75d6b145-49bd-4dfe-9ea5-18b84b96d3b2/download
If it not enough, we will plan to restart the pug test.
@Woljtek Please make another test with the following configuration (replace existing parts of the config as needed):
app.preparation-worker.pdu.config.OL_0_EFR___.type=STRIPE
app.preparation-worker.pdu.config.OL_0_EFR___.reference=DUMP
app.preparation-worker.pdu.config.OL_0_EFR___.length-in-s=6060
app.preparation-worker.pdu.config.OL_0_EFR___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.OL_0_EFR___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.OL_1_EFR___.type=FRAME
app.preparation-worker.pdu.config.OL_1_EFR___.length-in-s=180
app.preparation-worker.pdu.config.OL_1_ERR___.type=STRIPE
app.preparation-worker.pdu.config.OL_1_ERR___.reference=DUMP
app.preparation-worker.pdu.config.OL_1_ERR___.length-in-s=6060
app.preparation-worker.pdu.config.OL_1_ERR___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.OL_1_ERR___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.OL_2_LFR___.type=FRAME
app.preparation-worker.pdu.config.OL_2_LFR___.length-in-s=180
app.preparation-worker.pdu.config.OL_2_LRR___.type=STRIPE
app.preparation-worker.pdu.config.OL_2_LRR___.reference=DUMP
app.preparation-worker.pdu.config.OL_2_LRR___.length-in-s=6060
app.preparation-worker.pdu.config.OL_2_LFR___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.OL_2_LFR___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.SL_0_SLT___.type=STRIPE
app.preparation-worker.pdu.config.SL_0_SLT___.reference=DUMP
app.preparation-worker.pdu.config.SL_0_SLT___.length-in-s=6187
app.preparation-worker.pdu.config.SL_0_SLT___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.SL_0_SLT___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.SL_1_RBT___.type=FRAME
app.preparation-worker.pdu.config.SL_1_RBT___.length-in-s=180
app.preparation-worker.pdu.config.SL_1_RBT___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.SL_1_RBT___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.SL_2_LST___.type=FRAME
app.preparation-worker.pdu.config.SL_2_LST___.length-in-s=180
app.preparation-worker.pdu.config.SL_2_LST___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.SL_2_LST___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.SR_0_SRA___.type=STRIPE
app.preparation-worker.pdu.config.SR_0_SRA___.reference=ORBIT
app.preparation-worker.pdu.config.SR_0_SRA___.length-in-s=3029.6
app.preparation-worker.pdu.config.SR_0_SRA___.offset-in-s=1512.59
app.preparation-worker.pdu.config.SR_0_SRA___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.SR_0_SRA___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.SR_1_SRA___.type=STRIPE
app.preparation-worker.pdu.config.SR_1_SRA___.reference=DUMP
app.preparation-worker.pdu.config.SR_1_SRA___.length-in-s=6187
app.preparation-worker.pdu.config.SR_1_SRA___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.SR_1_SRA___.dyn-proc-params.hardwareName=O
app.preparation-worker.pdu.config.SR_2_LAN___.type=STRIPE
app.preparation-worker.pdu.config.SR_2_LAN___.reference=DUMP
app.preparation-worker.pdu.config.SR_2_LAN___.length-in-s=6187
app.preparation-worker.pdu.config.SR_2_LAN___.dyn-proc-params.facilityName=LN3
app.preparation-worker.pdu.config.SR_2_LAN___.dyn-proc-params.hardwareName=O
...
app.housekeep.pdu.config.OL_0_EFR___.type=STRIPE
app.housekeep.pdu.config.OL_0_EFR___.reference=DUMP
app.housekeep.pdu.config.OL_0_EFR___.length-in-s=6060
app.housekeep.pdu.config.OL_0_EFR___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.OL_0_EFR___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.OL_1_EFR___.type=FRAME
app.housekeep.pdu.config.OL_1_EFR___.length-in-s=180
app.housekeep.pdu.config.OL_1_ERR___.type=STRIPE
app.housekeep.pdu.config.OL_1_ERR___.reference=DUMP
app.housekeep.pdu.config.OL_1_ERR___.length-in-s=6060
app.housekeep.pdu.config.OL_1_ERR___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.OL_1_ERR___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.OL_2_LFR___.type=FRAME
app.housekeep.pdu.config.OL_2_LFR___.length-in-s=180
app.housekeep.pdu.config.OL_2_LRR___.type=STRIPE
app.housekeep.pdu.config.OL_2_LRR___.reference=DUMP
app.housekeep.pdu.config.OL_2_LRR___.length-in-s=6060
app.housekeep.pdu.config.OL_2_LFR___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.OL_2_LFR___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.SL_0_SLT___.type=STRIPE
app.housekeep.pdu.config.SL_0_SLT___.reference=DUMP
app.housekeep.pdu.config.SL_0_SLT___.length-in-s=6187
app.housekeep.pdu.config.SL_0_SLT___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.SL_0_SLT___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.SL_1_RBT___.type=FRAME
app.housekeep.pdu.config.SL_1_RBT___.length-in-s=180
app.housekeep.pdu.config.SL_1_RBT___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.SL_1_RBT___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.SL_2_LST___.type=FRAME
app.housekeep.pdu.config.SL_2_LST___.length-in-s=180
app.housekeep.pdu.config.SL_2_LST___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.SL_2_LST___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.SR_0_SRA___.type=STRIPE
app.housekeep.pdu.config.SR_0_SRA___.reference=ORBIT
app.housekeep.pdu.config.SR_0_SRA___.length-in-s=3029.6
app.housekeep.pdu.config.SR_0_SRA___.offset-in-s=1512.59
app.housekeep.pdu.config.SR_0_SRA___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.SR_0_SRA___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.SR_1_SRA___.type=STRIPE
app.housekeep.pdu.config.SR_1_SRA___.reference=DUMP
app.housekeep.pdu.config.SR_1_SRA___.length-in-s=6187
app.housekeep.pdu.config.SR_1_SRA___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.SR_1_SRA___.dyn-proc-params.hardwareName=O
app.housekeep.pdu.config.SR_2_LAN___.type=STRIPE
app.housekeep.pdu.config.SR_2_LAN___.reference=DUMP
app.housekeep.pdu.config.SR_2_LAN___.length-in-s=6187
app.housekeep.pdu.config.SR_2_LAN___.dyn-proc-params.facilityName=LN3
app.housekeep.pdu.config.SR_2_LAN___.dyn-proc-params.hardwareName=O
The WA has been successfully applied: source : s3://ops-rs-failed-workdir/s3-pug-preint-part1-execution-worker-v5-5fc6cb6788-mr9zg_S3B_SR_0_SRA__20230409T231430_20230409T232430_20230410T001628_0599_078_130____LN3_D_NR_002.SEN3_0664b6d5-c6c8-4086-a6c3-4377502e49ab_0/JobOrder.4737.xml
On the PUG EW logs, the error disappeared:
kp logs s3-pug-preint-part1-execution-worker-v5-5fc6cb6788-mr9zg -c s3-pug-preint-part1-execution-worker-v5 | grep stack | wc -l
0
A add the label workaround and decrease the priority.
SYS_CCB_w17 : Release 1.13 solves the issue (refer to https://github.com/COPRS/processing-sentinel-3/releases/tag/1.13.1-rc1)
Environment:
Traçability: S3 L1 PUG Deployment
Current Behavior: All PUG jobs ends by the following error:
Expected Behavior: IPF S3 L1 PUG successfully consolidates S3 L0 products
Steps To Reproduce: Produce L0 ISP (from S3-L0p) Deploy PUG NRT addons.
Test execution artefacts (i.e. logs, screenshots…) https://app.zenhub.com/files/398313496/2cb548e3-e6e8-43ab-92d2-a69afdf414ce/download
Whenever possible, first analysis of the root cause On container, there is the following node:
No clue from resources consumption:
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)