COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [OPS] Missing S1 AIO And of activity trace #487

Closed suberti-ads closed 2 years ago

suberti-ads commented 2 years ago

This behavior seems a bit different than on following https://github.com/COPRS/rs-issues/issues/455 So i prefer create another topic.

Environment: Platform: RS V1.0.1 OPS Orange Cloud Configuration: rs-config branch op based on 0.5.0-rc1 MIS / rs-processing-config 0.4.0-rc2

Current Behavior: An AIO trace have been lost despite execution worker seems OK.

Expected Behavior: There is no missing trace.

Steps To Reproduce: Issue seen on operational environnement. Currently only one case have been detected.

Test execution artefacts (i.e. logs, screenshots…) Due to logrotate (activity done at about 2022-07-25T06:03:53.476 and first log seen the 2022-07-25T06:03:55.025000Z), i don't see trace should be sent to catalog.

 safescale  gw-ops-cluster  ~  kubectl logs -n processing s1pro-l0-aio-ipf-execution-worker-0 -c s1pro-l0-aio-ipf-execution-worker | head -1
{"header":{"type":"LOG","timestamp":"2022-07-25T06:03:55.025000Z","level":"INFO","line":204,"file":"JobProcessor.java","thread":"pool-2-thread-1"},"message":{"content":"Initializing job processing [...]

i used mongo and internal log to found activity execution information.

Whenever possible, first analysis of the root cause

on S1 AIO exxecution worker: There is 5 AIO OK but only 4 trace have been ingested in trace catalog:

rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-24T23:59:59Z")},"group":"l0-aio-ipf-execution-worker"},{"dto.keyObjectStorage":1,"state":1});
{ "_id" : NumberLong(2406590), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725010722044250" } }
{ "_id" : NumberLong(2407898), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725024554044251" } }
{ "_id" : NumberLong(2410955), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725042413044252" } }
{ "_id" : NumberLong(2411664), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725043407044252" } }
{ "_id" : NumberLong(2411899), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725044105044252" } }

Missing trace for session for S1A_20220725043407044252 on NSG station:

AIOTraceLost AIOTraceLost2

Herafter internal pod log found for this execution:

bash-4.2$ cat  logfile.log  | grep S1A_20220725043407044252 | head -1
2022-07-25T05:52:55.268 | INFO  | e.s.c.i.e.w.s.JobProcessor [pool-2-thread-1]: Initializing job processing {id: 2411664, inputKey: t-pdgs-aio-execution-jobs, body: IpfExecutionJob [productFamily=L0_JOB, keyObjectStorage=S1A_20220725043407044252, creationDate=Mon Jul 25 05:52:54 UTC 2022, hostname=s1pro-l0-aio-ipf-preparation-worker-0, productProcessMode=NOMINAL, workDirectory=/data/localWD/10002/, jobOrder=/data/localWD/10002/JobOrder.10002.xml, timeliness=, inputs=[{family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml, contentRef: S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF, contentRef: S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF, contentRef: S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF}, {family: JOB_ORDER, localPath: /data/localWD/10002/JobOrder.10002.xml, contentRef: <?xml version="1.0" encoding="UTF-8"?>
bash-4.2$ cat  logfile.log  | grep 2411664                           
2022-07-25T05:52:55.268 | INFO  | e.s.c.i.e.w.s.JobProcessor [pool-2-thread-1]: Initializing job processing {id: 2411664, inputKey: t-pdgs-aio-execution-jobs, body: IpfExecutionJob [productFamily=L0_JOB, keyObjectStorage=S1A_20220725043407044252, creationDate=Mon Jul 25 05:52:54 UTC 2022, hostname=s1pro-l0-aio-ipf-preparation-worker-0, productProcessMode=NOMINAL, workDirectory=/data/localWD/10002/, jobOrder=/data/localWD/10002/JobOrder.10002.xml, timeliness=, inputs=[{family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml, contentRef: S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF, contentRef: S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF, contentRef: S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF}, {family: JOB_ORDER, localPath: /data/localWD/10002/JobOrder.10002.xml, contentRef: <?xml version="1.0" encoding="UTF-8"?>
2022-07-25T06:03:53.476 | INFO  | e.s.c.m.c.MqiConsumer [pool-2-thread-1]: MqiConsumer [category=LEVEL_JOBS] handled 2411664 successfully, done! 

pod was hosted on node and no error log present on fluentbit

 safescale  gw-ops-cluster  ~  kubectl get po -n processing -o wide | grep aio
s1pro-l0-aio-ipf-execution-worker-0                       3/3     Running     0             10d     10.244.9.26      ops-cluster-node-21   <none>           <none>
s1pro-l0-aio-ipf-preparation-worker-0                     3/3     Running     0             10d     10.244.240.42    ops-cluster-node-17   <none>           <none>
s1pro-l0-aio-production-trigger-0                         3/3     Running     0             10d     10.244.78.87     ops-cluster-node-3    <none>           <none>
 safescale  gw-ops-cluster  ~  kubectl get po -n logging -o wide | grep ops-cluster-node-21
fluent-bit-mhjpv   2/2     Running   0             4d7h   10.244.9.30      ops-cluster-node-21    <none>           <none>
 safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-mhjpv -c fluent-bit 
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/07/21 10:06:57] [error] [parser] parser named 'cri' already exists, skip.

Note:

Bug Generic Definition of Done (DoD)

LAQU156 commented 2 years ago

IVV_CCB_2022_w30 : Under Analysis, priority minor, waiting for reproduction of the problem

LAQU156 commented 2 years ago

IVV_CCB_2022_w31 : For now, no reproduction of the issue, still under analysis

LAQU156 commented 2 years ago

IVV_CCB_2022_w32 : Still no reproduction of the issue, will be closed in many weeks if the status stays.

LAQU156 commented 2 years ago

IVV_CCB_2022_w33 : Still no reproduction

LAQU156 commented 2 years ago

IVV_CCB_2022_w34 : Still no reproduction --> Closed