COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [OPS] S3 PUG-NTC Execution failed with error code 139 for products SY_2_VG1 #1052

Closed suberti-ads closed 1 year ago

suberti-ads commented 1 year ago

Environment:

Traceability:

Current Behavior: Execution failed with error code 139 for products SY_2_VG1

[code 290] [exitCode 139] [msg Task /usr/local/components/PUG-3.48/bin/PUGCoreProcessor failed]

Expected Behavior: Production successfully done with nominal production input.

Steps To Reproduce: Start 3% production or 24h test

Test execution artefacts (i.e. logs, screenshots…) Execution logs: NewErrorPUG-NTCjobOrder126304.txt

Job generated: Job126304.log

Whenever possible, first analysis of the root cause All issue were on for products SY_2_VG1 sample for joborder JobOrder.126304.xml Product : S3A_SY_2_VG1__20230428T000000_20230428T235900_20230725T143631_GLOBAL__MAR_D_NT_002 interval production:

    "startTime" : "2023-04-28T00:00:00.000000Z",
    "stopTime" : "2023-04-28T23:59:00.000000Z",

On log there was following issue during production

2023-07-26T16:01:42+00:00   {"header":{"type":"LOG","timestamp":"2023-07-26T16:01:42.262125Z","level":"INFO","line":129,"file":"TaskCallable.java","thread":"pool-250-thread-1"},"message":{"content":"Ending task /usr/local/components/PUG-3.48/bin/PUGCoreProcessor with exit code 139"},"custom":{"logger_string":"esa.s1pdgs.cpoc.ipf.execution.worker.job.process.TaskCallable"}}
2023-07-26T16:01:42+00:00    
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libApp.so.13.0 ( main )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libApp.so.13.0 ( acs::Application::run(int, char const* const*, char const* const*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libStandaloneApp.so.5.16 ( acs::IPFStandaloneApp::start(int, char const* const*, char const* const*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libStandaloneApp.so.5.16 ( acs::StandaloneApp::start(int, char const* const*, char const* const*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libApp.so.13.0 ( acs::Application::start(int, char const* const*, char const* const*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libApp.so.13.0 ( acs::Application::startMain(int, char const* const*, char const* const*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/PUGCoreProcessor ( acs::PUGCoreProcessor::main(int, char const* const*, char const* const*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/PUGCoreProcessor ( acs::PUGCoreProcessor::execute(std::string const&, std::string const&, acs::DateTime const&) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libS3PDUGenerator.so.2.2 ( acs::PDUGenerator::createPDUs() )
2023-07-26T16:01:42+00:00       /lib64/libc.so.6 ( usleep )
2023-07-26T16:01:42+00:00       /lib64/libc.so.6 ( nanosleep )
2023-07-26T16:01:42+00:00       /lib64/libpthread.so.0 (  )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libSignal.so.6.2 ( acs::SignalDispatcher::catchPrintStackSignal(int) )
2023-07-26T16:01:42+00:00   Stack trace for thread 140116091156672 - "Main"
2023-07-26T16:01:42+00:00   
2023-07-26T16:01:42+00:00       /lib64/libc.so.6 ( clone )
2023-07-26T16:01:42+00:00       /lib64/libpthread.so.0 (  )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/../lib/libThread.so.7.0 ( acs::Thread::svc(void*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libApp.so.13.0 ( acs::Application::ApplicationStopManager::run() )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libException.so.15.1 ( acs::StopController::timeoutOrCancel(unsigned long) const )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libException.so.15.1 ( acs::Condition::timedwait(unsigned long) const )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libException.so.15.1 ( acs::Condition::timedwait(timespec const&) const )
2023-07-26T16:01:42+00:00       /lib64/libpthread.so.0 ( pthread_cond_timedwait )
2023-07-26T16:01:42+00:00       /lib64/libpthread.so.0 (  )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libSignal.so.6.2 ( acs::SignalDispatcher::catchPrintStackSignal(int) )
2023-07-26T16:01:42+00:00   Stack trace for thread 140115653666560 - "ApplicationStopManagerThread"
2023-07-26T16:01:42+00:00   
2023-07-26T16:01:42+00:00       /lib64/libc.so.6 ( clone )
2023-07-26T16:01:42+00:00       /lib64/libpthread.so.0 (  )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/../lib/libThread.so.7.0 ( acs::Thread::svc(void*) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libS3PDUGenerator.so.2.2 ( acs::TileGeneratorThread::run() )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libS3PDUGenerator.so.2.2 ( acs::TileGeneratorThread::createPDU() )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libS3PDUGenerator.so.2.2 ( acs::PDUGeneratorThread::setInfoForStatistics() )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/PUGCoreProcessor ( std::string::assign(std::string const&) )
2023-07-26T16:01:42+00:00       /lib64/libpthread.so.0 (  )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libSignal.so.6.2 ( acs::SignalDispatcher::catchBadSignal(int) )
2023-07-26T16:01:42+00:00       /usr/local/components/PUG-3.48/bin/../lib/libSignal.so.6.2 ( acs::SignalDispatcher::catchPrintStackSignal(int) )
2023-07-26T16:01:42+00:00   Stack trace for thread 140115641050880 - "unnamedThread"
2023-07-26T16:01:42+00:00    core in: /tmp/core.560 - stack follow
2023-07-26T16:01:42+00:00    PID: 560 SignalDispatcher 11 THREAD: 140115641050880
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.097740 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [E] PUGCoreProcessor: Wed Jul 26 16:01:42 2023

No oom kill or event seen for production in this case error occurred at about 2023-07-26T16:00:00 on node-141 hereafter all log at this hour:

Jul 26 15:59:32 cluster-ops-node-141 systemd[1]: Starting Daily apt download activities...
Jul 26 15:59:33 cluster-ops-node-141 systemd[1]: apt-daily.service: Succeeded.
Jul 26 15:59:33 cluster-ops-node-141 systemd[1]: Finished Daily apt download activities.
Jul 26 16:02:03 cluster-ops-node-141 systemd[1]: run-containerd-runc-k8s.io-1410387f7897451a102f0cb5a1abc48c3ff486441a7e892372c910fa5880d03e-runc.JYY1oH.mount: Succeeded.
Jul 26 16:17:01 cluster-ops-node-141 CRON[1846135]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 26 16:26:33 cluster-ops-node-141 systemd[1]: run-containerd-runc-k8s.io-1410387f7897451a102f0cb5a1abc48c3ff486441a7e892372c910fa5880d03e-runc.GS3KzB.mount: Succeeded.
Jul 26 16:28:53 cluster-ops-node-141 systemd[1]: run-containerd-runc-k8s.io-1410387f7897451a102f0cb5a1abc48c3ff486441a7e892372c910fa5880d03e-runc.ovmXdf.mount: Succeeded.
Jul 26 17:00:26 cluster-ops-node-141 kubelet[890]: I0726 17:00:26.005787     890 prober.go:116] "Probe failed" probeType="Liveness" pod="security/falco-lr8kc" podUID=91d10

No resource issue: PodRessources

No event kubernetes seen. Issue is reproduce each time i restart execution

no core file has ben found on node or on pod.

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

SYTHIER-ADS commented 1 year ago

IS this configuration line correct, because we see space but one would expect ";"? app.preparation-worker.pdu.config.SY_2_VG1.dynProcParams.TileIdentifiers=AFRICA____ NORTH_AMERICA__ SOUTH_AMERICA____ CENTRAL_AMERICA NORTH_ASIA_ WEST_ASIA__ SOUTH_EAST_ASIA ASIAN_ISLANDS AUSTRALASIA__ EUROPE_____ app.preparation-worker.pdu.config.SY_2_VG1___.dynProcParams.TileCoordinates=[-35.0 -26.0 -35.0 60.0 38.0 60.0 38.0 -26.0 -35.0 -26.0];[40.0 -180.0 40.0 -13.0 75.0 -13.0 75.0 -180.0 40.0 -180.0];[0.0 -125.0 0.0 -50.0 50.0 -50.0 50.0 -125.0 0.0 -125.0];[-56.0 -93.0 -56.0 -33.0 25.0 -33.0 25.0 -93.0 -56.0 -93.0];[40.0 45.0 40.0 180.0 75.0 180.0 75.0 45.0 40.0 45.0];[5.0 25.0 5.0 98.0 50.0 98.0 50.0 25.0 5.0 25.0];[5.0 68.0 5.0 147.0 55.0 147.0 55.0 68.0 5.0 68.0];[-12.0 92.0 -12.0 170.0 29.0 170.0 29.0 92.0 -12.0 92.0];[-48.0 95.0 -48.0 180.0 10.0 180.0 10.0 95.0 -48.0 95.0];[25.0 -11.0 25.0 62.0 75.0 62.0 75.0 -11.0 25.0 -11.0];

w-fsi commented 1 year ago

@SYTHIER-ADS : May I ask why you do think that this is the issue? The answer to your question is very difficult. We are quite sure that this approach is the right one as it is used in the legacy project exactly that way and also indicated by the simulator to use it that way. We are basically not having an up to date documentation for the PUG and just having a few pages on these parameter in the old PUG ICD from 2016 that indeed is asking for a GML and also having a different description on how to use the Identifiers.

We double checked however if there was some routine in the old system that is somehow translating these information into a GML and couldn't find it. The configuration was took exactly that was from the legacy system. It might be that the ICD was updated afterwards and decided to use a more simple configuration as a markup can be attracting errors very easily.

So if you're having any inputs on these parameters, please share them with us. And I would be really interested why you assume this to be the problem, so we are not hunting phantoms here and the actual error is triggered. I agree however that something like this might be caused by a SIGSEV.

The parameter should be added directly to the JO, so theoretically when adding the other format as configuration it should be added the same way as configured. I would recommend however to manually execute the PUG on a failed working directory with modified parameters to see if it actually fixing the issue.

SYTHIER-ADS commented 1 year ago

@w-fsi, thanks for the feedback, so what is your analysis?

w-jka commented 1 year ago

@SYTHIER-ADS

2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.093815 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (25.000000, -11.000000); (25.000000, 62.000000); (75.000000, 62.000000); (75.000000, -11.000000); (25.000000, -11.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.093445 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-48.000000, 95.000000); (-48.000000, 180.000000); (10.000000, 180.000000); (10.000000, 95.000000); (-48.000000, 95.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.093099 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-12.000000, 92.000000); (-12.000000, 170.000000); (29.000000, 170.000000); (29.000000, 92.000000); (-12.000000, 92.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.092843 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (5.000000, 68.000000); (5.000000, 147.000000); (55.000000, 147.000000); (55.000000, 68.000000); (5.000000, 68.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.092572 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (5.000000, 25.000000); (5.000000, 98.000000); (50.000000, 98.000000); (50.000000, 25.000000); (5.000000, 25.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.092290 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (40.000000, 45.000000); (40.000000, 180.000000); (75.000000, 180.000000); (75.000000, 45.000000); (40.000000, 45.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.092016 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-56.000000, -93.000000); (-56.000000, -33.000000); (25.000000, -33.000000); (25.000000, -93.000000); (-56.000000, -93.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.091724 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (0.000000, -125.000000); (0.000000, -50.000000); (50.000000, -50.000000); (50.000000, -125.000000); (0.000000, -125.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.091416 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (40.000000, -180.000000); (40.000000, -13.000000); (75.000000, -13.000000); (75.000000, -180.000000); (40.000000, -180.000000) ]
2023-07-26T16:01:42+00:00   2023-07-26T16:01:42.091108 s3-pug-ntc-part1-execution-worker-v2-698877f8f7-x8vpx PUG_SY_2_VG1 03.48 [0000000560]: [I] PUGCoreProcessor: [TileGenerator.C: getPDUsToProcess:(186)] PDU requested for coordinates: [ (-35.000000, -26.000000); (-35.000000, 60.000000); (38.000000, 60.000000); (38.000000, -26.000000); (-35.000000, -26.000000) ]

These logs indicate, that the dynamic processing parameter for the geographic definition of the tiles is working as intended. The processor correctly identifies the list and prints it in a different format, meaning it was successfully parsed.

My first assumption would be, based on the point in time when the error occurs, that the IPF cannot handle the empty dynamic process parameters for pduLength and MtdPDUFrameNumbers well, while they are not needed for TILE PDUs they are still listed for VG1 and V10 products. In order to validate this assumption I would need you to update the tasktable_configmap (Tasktable TaskTable.PUG_SY_2_VG1.03.xml) with the following part:

<List_of_Dyn_ProcParams count="10">
        <Dyn_ProcParam>
          <Param_Name>hardwareName</Param_Name>
          <Param_Type>string</Param_Type>
          <Param_Default></Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>orderType</Param_Name> <!-- Timeliness of the product -->
          <Param_Type>String</Param_Type>
          <Param_Default>NRT</Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>facilityName</Param_Name>
          <Param_Type>string</Param_Type>
          <Param_Default>MAR</Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>TileCoordinates</Param_Name>
          <Param_Type>string</Param_Type>
          <Param_Default></Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>TileIdentifiers</Param_Name>
          <Param_Type>string</Param_Type>
          <Param_Default></Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>QcApply</Param_Name>
          <Param_Type>string</Param_Type>
          <Param_Default>false</Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>browseStubMode</Param_Name>          <!-- Provided by PDGS-->
          <Param_Type>String</Param_Type>
          <Param_Default>false</Param_Default>
        </Dyn_ProcParam>
        <Dyn_ProcParam>
          <!-- Dyn parameter for OLQC-->
          <Param_Name>OLQCReportTemplate</Param_Name>
          <Param_Type>String</Param_Type>
          <Param_Default>OLQC_Main.jasper</Param_Default>
        </Dyn_ProcParam>    
              <Dyn_ProcParam>
                  <Param_Name>baselineCollection</Param_Name>          <!-- Provided by PDGS-->
                  <Param_Type>String</Param_Type>
                  <Param_Default>002</Param_Default>
              </Dyn_ProcParam>
        <Dyn_ProcParam>
          <Param_Name>pduType</Param_Name>
          <Param_Type>String</Param_Type>
          <Param_Default>tile</Param_Default>
        </Dyn_ProcParam>
      </List_of_Dyn_ProcParams>

In order for this workaround to be applied, the joborder has to be regenerated by the preparation-worker, as this fix is applied to the preparation-worker.

Woljtek commented 1 year ago

In order to test the workaround proposed by @w-jka , I apply the following procedure:

1. Identify JobOrder with Grafana: 4 errors: JobOrder.126304.xml JobOrder.129238.xml JobOrder.129309.xml JobOrder.129239.xml

2. On mongo DB, get ProductName that triggers the job.:

rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":126304}, {"productName":1})
{ "_id" : NumberLong(126304), "productName" : "S3A_SY_2_VG1____20230428T000000_20230428T235900_20230725T143631_GLOBAL____________MAR_D_NT_002.SEN3" }
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":129238}, {"productName":1})
{ "_id" : NumberLong(129238), "productName" : "S3A_SY_2_VG1____20230409T000000_20230409T235900_20230727T030335_GLOBAL____________MAR_D_NT_002.SEN3" }
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":129309}, {"productName":1})
{ "_id" : NumberLong(129309), "productName" : "S3B_SY_2_VG1____20230409T000000_20230409T235900_20230728T030335_GLOBAL____________MAR_D_NT_002.SEN3" }
rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"}, "_id":129239}, {"productName":1})
{ "_id" : NumberLong(129239), "productName" : "S3B_SY_2_VG1____20230428T000000_20230428T235900_20230726T151656_GLOBAL____________MAR_D_NT_002.SEN3" }

3. Stop the RS addon

#Gateway
kp get po | grep pug-ntc

The RS addons is stopped

4. Apply the WA Here the diff on tasktable_configmap.yaml (branch rs-1052)

git diff tasktable_configmap.yaml
diff --git a/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml b/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml
index 9faa574..de537eb 100644
--- a/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml
+++ b/apps/rs-addon/rs-addon-s3-pug-ntc_Executables/additional_resources/tasktable_configmap.yaml
@@ -4377,7 +4377,7 @@ data:
           <!-- config for  OLQC Quality Checks QC-Check Configuration-->
         </Cfg_File>
       </List_of_Cfg_Files>
-      <List_of_Dyn_ProcParams count="12">
+      <List_of_Dyn_ProcParams count="10">
         <Dyn_ProcParam>
           <Param_Name>hardwareName</Param_Name>
           <Param_Type>string</Param_Type>
@@ -4424,16 +4424,6 @@ data:
                   <Param_Type>String</Param_Type>
                   <Param_Default>002</Param_Default>
               </Dyn_ProcParam>
-        <Dyn_ProcParam>
-          <Param_Name>pduLength</Param_Name>
-          <Param_Type>number</Param_Type>
-          <!--Param_Default>180</Param_Default-->
-        </Dyn_ProcParam>
-        <Dyn_ProcParam>
-          <Param_Name>MtdPDUFrameNumbers</Param_Name>
-          <Param_Type>number</Param_Type>
-          <!--Param_Default>0</Param_Default-->
-        </Dyn_ProcParam>
         <Dyn_ProcParam>
           <Param_Name>pduType</Param_Name>
           <Param_Type>String</Param_Type>

5. Delete job in MongoDB

rs0:PRIMARY> db.appDataJob.deleteMany({"_id":126304})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":129238})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":129309})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> db.appDataJob.deleteMany({"_id":129239})
{ "acknowledged" : true, "deletedCount" : 1 }
rs0:PRIMARY> 

6. Start the RS addon

#Bastion
ansible-playbook deploy-rs-addon.yaml -i inventory/sample/hosts.yaml -e rs_addon_location=../apps/rs-addon/rs-addon-s3-pug-ntc.zip -e stream_name=s3-pug-ntc

#Gateway
kp get po | grep pug-ntc
s3-pug-ntc-part1-execution-worker-v3-c6ddc5cf6-nz644              0/2     Pending   0                6m
s3-pug-ntc-part1-message-filter-v3-56bd94ffc-jjhlv                2/2     Running   0                5m58s
s3-pug-ntc-part1-preparation-worker-v3-d64fb7bb7-l6k8j            2/2     Running   0                5m52s
s3-pug-ntc-part2-housekeep-v3-7669cd666-2dgbh                     2/2     Running   0                5m52s
s3-pug-ntc-part2-time-v3-5dbbb5f8f5-qcxrb                         2/2     Running   0                5m54s

kp describe cm s3-pug-ntc-tasktables | grep -A55  "PUG_SY_2_VG1.03" | grep "pduLength" | wc -l
0

7. Republish message before the Preparation-Worker

PN_LIST="S3A_SY_2_VG1____20230428T000000_20230428T235900_20230725T143631_GLOBAL____________MAR_D_NT_002 S3B_SY_2_VG1____20230428T000000_20230428T235900_20230726T151656_GLOBAL____________MAR_D_NT_002 S3B_SY_2_VG1____20230409T000000_20230409T235900_20230728T030335_GLOBAL____________MAR_D_NT_002 S3A_SY_2_VG1____20230409T000000_20230409T235900_20230727T030335_GLOBAL____________MAR_D_NT_002"

TOPIC=s3-pug-ntc-part1.message-filter
for PN in $PN_LIST; do 
   echo "--------------- $PN -----------------"
   MESSAGE_FILE=$PN.mf_msg
   kubectl -n ${NAMESPACE_KAFKA} exec -ti ${POD} -c ${CONTAINER} -- bash /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server ${BOOTSTRAP_URL} --topic ${TOPIC} --from-beginning --timeout-ms 20000 | grep $PN > $MESSAGE_FILE
   kubectl cp -n ${NAMESPACE_KAFKA} -c ${CONTAINER} ${MESSAGE_FILE} ${POD}:/tmp/message.txt
   kubectl -n ${NAMESPACE_KAFKA} exec -ti ${POD} -c ${CONTAINER} -- sh -c "/opt/kafka/bin/kafka-console-producer.sh --bootstrap-server ${BOOTSTRAP_URL} --topic ${TOPIC} < /tmp/message.txt" 
   rm -v $MESSAGE_FILE
done
Woljtek commented 1 year ago

Jobs have been successfully recreated:

rs0:PRIMARY> db.appDataJob.find({"pod" : { $regex: "pug-ntc"},"creationDate":{$gt: ISODate("2023-07-31T13:00:00.000Z")}}, {"productName":1, "generation.state":1});
{ "_id" : NumberLong(132735), "productName" : "S3A_SY_2_VG1____20230428T000000_20230428T235900_20230725T143631_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(132736), "productName" : "S3B_SY_2_VG1____20230428T000000_20230428T235900_20230726T151656_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(132737), "productName" : "S3B_SY_2_VG1____20230409T000000_20230409T235900_20230728T030335_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }
{ "_id" : NumberLong(132738), "productName" : "S3A_SY_2_VG1____20230409T000000_20230409T235900_20230727T030335_GLOBAL____________MAR_D_NT_002.SEN3", "generation" : { "state" : "SENT" } }

Jobs execution are waiting for available node and lag consumption: image.png

suberti-ads commented 1 year ago

Good News! Production successfully done after apply WA:

image

LAQU156 commented 1 year ago

System_CCB_2023_w31 : The workaround proposed by Werum works. Moved into "Accepted Werum"

suberti-ads commented 1 year ago

We can decrease priority as we have a workaround which works

Woljtek commented 1 year ago

System_CCB_2023_w34: Handled by version 1.14.0

Woljtek commented 1 year ago

System_CCB_2023_w34: Successfully tested (fix=wa)