Closed suberti-ads closed 1 year ago
From what I understand, your theory is that the lifecycle logs are too long and as a result, the processing traces were not taken into account by fluentbit. Am I correct?
Yes, and due to these too long logs, other traces on same nodes were lost.
With my current understanding of fluentbit, I would say that when a file is too large, fluentbit skips it, but doesn't skip the other files that it is supposed to process. I would say it is more likely that the S3 traces are not picked up by fluentbit at all. Could you try excluding the data-lifecycle logs from fluentbit to check whether the other logs are still skipped by fluentbit?
I also notice there are logs during Thursday (the day the S3 data are processed) and not after. Can you confirm there are logs afterwards nonetheless?
I don't understand, i tried both WA but i see error in fluentbit log despite of this.
Hereafter current configuration to exclude the data-lifecycle log:
[INPUT]
Name tail
parser cri
Tag processing.log.*
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.log,/var/log/containers/nmap-job-*.log,/var/log/containers/lifecycle-eviction-job-*processing_lifecycle-eviction-job*.log
DB /var/log/fluentbit_processing.db
DB.Sync Normal
Rotate_Wait 10
Buffer_Max_Size 1MB
Mem_Buf_Limit 10MB
but following log it still try to treat lifecycle data:
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-ww9tv -c fluent-bit | tail -5
[2022/07/15 13:09:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:09:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630720--1-kdk92_processing_lifecycle-eviction-job-531b27816fce7accb611e89fd581482adf207264a838ea26ba18c67c32220d09.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:10:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630240--1-58x8p_processing_lifecycle-eviction-job-f4b624c750efbb4b7da583be056d012c7f1618a19519d2adedddaf65edc42161.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:10:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:10:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630720--1-kdk92_processing_lifecycle-eviction-job-531b27816fce7accb611e89fd581482adf207264a838ea26ba18c67c32220d09.log requires a larger buffer size, lines are too long. Skipping file.
I also notice there are logs during Thursday (the day the S3 data are processed) and not after. Can you confirm there are logs afterwards nonetheless? S3 Production occurred the tuesday between 00:00 and 07:59, some error have been restarted wednesday. on last session for 560 s3 l0 production only 197/560 preparation trace have been ingested in catalog. only 192/507 execution trace OK have been ingested in catalog. only 24/53 execution trace NOK have been ingested in catalog.
Have you restarted the daemonset once you have updated the configuration?
S3 Production occurred the tuesday between 00:00 and 07:59, some error have been restarted wednesday.
I meant, Tuesday, sorry. I believe these sessions are seen on the grafana dashboard. On Tuesday, there are logs from the production trigger to the preparation worker, and logs between the preparation worker and the execution worker. On Wednesday, for the sessions that were relaunched, there are logs for between the preparation worker and the execution worker only. Before restarting the fluentbit daemonset, can you check if the logs before the execution job notice an issue on S3 logs?
Have you restarted the daemonset once you have updated the configuration?
I confim all fluentbit pods have been automatically restarted after configuration upgrade by ansible-playbook.
I meant, Tuesday, sorry. I believe these sessions are seen on the grafana dashboard. On Tuesday, there are logs from the production trigger to the preparation worker, and logs between the preparation worker and the execution worker. On Wednesday, for the sessions that were relaunched, there are logs for between the preparation worker and the execution worker only.
Sorry i don't unserstand. i trie I detect on s3 production missing data on elasticsearch. on attach file, i show trace based which count number of trigger / preparation and worker. For me trigger data were OK. i confirm there was 560 trigger execution. After we should have 560 preparation job execution but we count only 197 At the end we have also 560 execution OK but we see only 192 During session i detect more than 53 failed execution and i see only 24 ones
To identify issue, i focus on a missing trace :
For product S3A_SR_0_SRA__20220712T015840_20220712T020840_20220713T150144_0599_087_245____LN1_D_NR_002.SEN3 successfully generated by S3 L0 on pod s1pro-s3-l0-execution-worker-0 the 2022-07-13 at about 15:02:10.651000 .
i found trace manually on pod:
safescale gw-ops-cluster ~ kubectl logs -n processing s1pro-s3-l0-execution-worker-0 -c s1pro-s3-l0-execution-worker | grep "2022-07-13" | grep "End job processing" | grep "\"status\":\"OK\"" | grep S3A_SR_0_SRA____20220712T015840_20220712T020840_20220713T150144_0599_087_245______LN1_D_NR_002
{"header":{"type":"REPORT","timestamp":"2022-07-13T15:02:10.651000Z","level":"INFO","mission":"S3","workflow":"NOMINAL"},"message":{"content":"End job processing"},"task":{"uid":"6bdacebe-7b47-4c33-a03a-377dd37091fc","name":"JobProcessing","event":"END","status":"OK","output":{"debug":false,"filename_strings":["S3A_SR_0_SRA____20220712T015840_20220712T020840_20220713T150144_0599_087_245______LN1_D_NR_002.SEN3"]},"input":{"filename_strings":["S3A_SR_0_SRA__G_20220712T015858_20220712T020858_20220712T030708_0600______________SVL_O_NR_OPE.ISIP","S3A_AX___OSF_AX_20160216T192404_99991231T235959_20220330T090651___________________EUM_O_AL_001.SEN3","S3A_AX___FRO_AX_20220708T000000_20220718T000000_20220711T065229___________________EUM_O_AL_001.SEN3","JobOrder.9006.xml"],"job_order_id_string":"JobOrder.9006.xml","ipf_release_string":"06.12"},"quality":{},"error_code":0,"duration_in_seconds":49.554}}
But i don't found production trace on elastic
Before restarting the fluentbit daemonset, can you check if the logs before the execution job notice an issue on S3 logs?
as far i remember i don't see specific error which time period match with my s3 ipf execution.
Today, I restart an ACQ error which generate some S3_L0. 59 New S3 L0 has been successfully generated but there is no trace of production.
But there is 59 Production done :
rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-17T23:59:59Z")},"group":"l0-s3-ipf-execution-worker","state":"ACK_OK"},{"dto.keyObjectStorage":1}).count();
59
rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-17T23:59:59Z")},"group":"l0-s3-ipf-execution-worker","state":"ACK_OK"},{"dto.keyObjectStorage":1,"lastSendDate":1});
{ "_id" : NumberLong(2320214), "lastSendDate" : ISODate("2022-07-18T11:35:44.532Z"), "dto" : { "keyObjectStorage" : "S3B_GN_0_GNS__G_20220712T034414_20220712T052638_20220718T112900_6144______________SVL_O_NR_OPE.ISIP" } }
{ "_id" : NumberLong(2320360), "lastSendDate" : ISODate("2022-07-18T11:36:22.934Z"), "dto" : { "keyObjectStorage" : "S3B_SR_0_SRA__G_20220712T035414_20220712T040414_20220718T112952_0600______________SVL_O_NR_OPE.ISIP" } }
[...]
There is no new logs on fluentbit (last error date from last friday):
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-ww9tv -c fluent-bit | tail -5
[2022/07/15 15:31:47] [ warn] [output:kafka:kafka.7] message delivery failed: Local: Unknown partition
[2022/07/15 15:31:47] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 14ms in state APIVERSION_QUERY, 4 identical error(s) suppressed)
[2022/07/15 15:46:00] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/15 16:04:13] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 95ms in state UP)
[2022/07/15 16:32:11] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
There are indeed connection errors on the fluentbit pod for the execution worker (node 22). It seems fluentbit hasn't been able to connect to kafka, which as result, has made it unable relay the traces. Not all fluentbit pods have this issue. Could you restart the fluentbit daemonset? It should eliminate some possibilities
Some update were stuck due to lack of cpu on some nodes ( may be configuration to be update) After cordon/uncordon node and killing some pod. fluentbit have been restart on all nodes. Hereafter error find on node which host s3_l0 worker and data_lifecycle job:
ESC[1mFluent Bit v1.8.10ESC[0m
* ESC[1mESC[93mCopyright (C) 2019-2021 The Fluent Bit AuthorsESC[0m
* ESC[1mESC[93mCopyright (C) 2015-2018 Treasure DataESC[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/07/18 16:25:21] [error] [parser] parser named 'cri' already exists, skip.
[2022/07/18 16:25:24] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 1042ms in state UP)
[2022/07/18 16:25:26] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 310ms in state UP, 1 identical error(s) suppressed)
[2022/07/18 16:25:28] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:app]: fluent-bit#producer-2: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP)
[2022/07/18 16:25:29] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:29] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:app]: fluent-bit#producer-2: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 576ms in state UP, 1 identical error(s) suppressed)
[2022/07/18 16:25:30] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:30] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:31] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:32] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:32] [ warn] [input] tail.0 paused (mem buf overlimit)
[...]
The warning is not an issue. Fluentbit pauses the tail while it flushes out the buffer then starts again.
It seems traces are now reported again on Grafana, however, there is still a loss. Is there a way for you to define the traces that are lost (by using Kibana for instance), to determine how the fluentbit configuration should be changed?
IVV_CCB_2022_w28 : Move to Under analysis. The root cause is not understand. A workarounds is currently applied. We need more inputs to identify the scenarii of the bug. @suberti-ads : Please remove the applied WA which seems not to have any effect.
Hereafter a trace lost. Bad there is logrotate on pod so i can't retrieve lost trace. This time on S1 AIO: There is 5 AIO OK but only 4 trace have been ingested in trace catalog:
rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-24T23:59:59Z")},"group":"l0-aio-ipf-execution-worker"},{"dto.keyObjectStorage":1,"state":1});
{ "_id" : NumberLong(2406590), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725010722044250" } }
{ "_id" : NumberLong(2407898), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725024554044251" } }
{ "_id" : NumberLong(2410955), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725042413044252" } }
{ "_id" : NumberLong(2411664), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725043407044252" } }
{ "_id" : NumberLong(2411899), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725044105044252" } }
Missing trace for session for S1A_20220725043407044252
Herafter log found for this execution:
bash-4.2$ cat logfile.log | grep S1A_20220725043407044252 | head -1
2022-07-25T05:52:55.268 | INFO | e.s.c.i.e.w.s.JobProcessor [pool-2-thread-1]: Initializing job processing {id: 2411664, inputKey: t-pdgs-aio-execution-jobs, body: IpfExecutionJob [productFamily=L0_JOB, keyObjectStorage=S1A_20220725043407044252, creationDate=Mon Jul 25 05:52:54 UTC 2022, hostname=s1pro-l0-aio-ipf-preparation-worker-0, productProcessMode=NOMINAL, workDirectory=/data/localWD/10002/, jobOrder=/data/localWD/10002/JobOrder.10002.xml, timeliness=, inputs=[{family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml, contentRef: S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF, contentRef: S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF, contentRef: S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF}, {family: JOB_ORDER, localPath: /data/localWD/10002/JobOrder.10002.xml, contentRef: <?xml version="1.0" encoding="UTF-8"?>
bash-4.2$ cat logfile.log | grep 2411664
2022-07-25T05:52:55.268 | INFO | e.s.c.i.e.w.s.JobProcessor [pool-2-thread-1]: Initializing job processing {id: 2411664, inputKey: t-pdgs-aio-execution-jobs, body: IpfExecutionJob [productFamily=L0_JOB, keyObjectStorage=S1A_20220725043407044252, creationDate=Mon Jul 25 05:52:54 UTC 2022, hostname=s1pro-l0-aio-ipf-preparation-worker-0, productProcessMode=NOMINAL, workDirectory=/data/localWD/10002/, jobOrder=/data/localWD/10002/JobOrder.10002.xml, timeliness=, inputs=[{family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml, contentRef: S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF, contentRef: S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF, contentRef: S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF}, {family: JOB_ORDER, localPath: /data/localWD/10002/JobOrder.10002.xml, contentRef: <?xml version="1.0" encoding="UTF-8"?>
2022-07-25T06:03:53.476 | INFO | e.s.c.m.c.MqiConsumer [pool-2-thread-1]: MqiConsumer [category=LEVEL_JOBS] handled 2411664 successfully, done!
pod was hosted on node and no error log present on fluentbit
safescale gw-ops-cluster ~ kubectl get po -n processing -o wide | grep aio
s1pro-l0-aio-ipf-execution-worker-0 3/3 Running 0 10d 10.244.9.26 ops-cluster-node-21 <none> <none>
s1pro-l0-aio-ipf-preparation-worker-0 3/3 Running 0 10d 10.244.240.42 ops-cluster-node-17 <none> <none>
s1pro-l0-aio-production-trigger-0 3/3 Running 0 10d 10.244.78.87 ops-cluster-node-3 <none> <none>
safescale gw-ops-cluster ~ kubectl get po -n logging -o wide | grep ops-cluster-node-21
fluent-bit-mhjpv 2/2 Running 0 4d7h 10.244.9.30 ops-cluster-node-21 <none> <none>
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-mhjpv -c fluent-bit
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/07/21 10:06:57] [error] [parser] parser named 'cri' already exists, skip.
Note:
New Occurrence: This time behavior is the same behaviour describe in this issue. For the above comment i will create a specific issue later.
On S3 Production All processing trace from s3_l0 preparator and s3_l0 execution worker currently missing:
This week there was 565 S3_L0 execution All preparation and execution trace missing:
Hereafter state for it in mongo
rs0:PRIMARY> db.appDataJob.find({"creationDate":{"$gte": new ISODate("2022-07-25T23:59:59Z")},"pod":"s1pro-s3-l0-ipf-preparation-worker-0","state" : "TERMINATED"}).count();
565
rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-25T23:59:59Z")},"group":"l0-s3-ipf-execution-worker","state":"ACK_OK"}).count();
565
Example for S3-l0 : S3A_SR_0_SRA__20220726T042907_20220726T043907_20220726T054125_0599_088_061____LN1_D_NR_002.SEN3
Hereafter trace should be ingested by catalog:
safescale gw-ops-cluster ~ kubectl logs -n processing s1pro-s3-l0-execution-worker-1 -c s1pro-s3-l0-execution-worker | grep "End job processing" | grep S3A_SR_0_SRA____20220726T042907_20220726T043907_20220726T054125_0599_088_061______LN1_D_NR_002.SEN3
{"header":{"type":"REPORT","timestamp":"2022-07-26T05:41:57.280000Z","level":"INFO","mission":"S3","workflow":"NOMINAL"},"message":{"content":"End job processing"},"task":{"uid":"59addb12-1798-4b9b-be82-7107fa737259","name":"JobProcessing","event":"END","status":"OK","output":{"debug":false,"filename_strings":["S3A_SR_0_SRA____20220726T042907_20220726T043907_20220726T054125_0599_088_061______LN1_D_NR_002.SEN3"]},"input":{"filename_strings":["S3A_SR_0_SRA__G_20220726T042925_20220726T043925_20220726T053005_0600______________SVL_O_NR_OPE.ISIP","S3A_AX___OSF_AX_20160216T192404_99991231T235959_20220330T090651___________________EUM_O_AL_001.SEN3","S3A_AX___FRO_AX_20220722T000000_20220801T000000_20220725T065150___________________EUM_O_AL_001.SEN3","JobOrder.10386.xml"],"job_order_id_string":"JobOrder.10386.xml","ipf_release_string":"06.12"},"quality":{},"error_code":0,"duration_in_seconds":55.418999}}
No trace seen for s3 execution woker since the Jul 19, 2022 @ 08:45:27.109:
All pod impacted were hosted on following node : ops-cluster-node-23
safescale gw-ops-cluster ~ kubectl get po -o wide -n processing | grep s3-l0
s1pro-s3-l0-execution-worker-0 3/3 Running 0 11d 10.244.140.144 ops-cluster-node-23 <none> <none>
s1pro-s3-l0-execution-worker-1 3/3 Running 0 11d 10.244.140.136 ops-cluster-node-23 <none> <none>
s1pro-s3-l0-ipf-preparation-worker-0 3/3 Running 0 8d 10.244.140.165 ops-cluster-node-23 <none> <none>
s1pro-s3-l0-production-trigger-0 3/3 Running 0 11d 10.244.136.77 ops-cluster-node-1 <none> <none>
hereafter fluentbit pod which should get log and trace:
safescale gw-ops-cluster ~ kubectl get po -o wide -n logging | grep ops-cluster-node-23
fluent-bit-2qq4z 2/2 Running 0 5d6h 10.244.140.193 ops-cluster-node-23 <none> <none>
On this except following loop error :
[2022/07/21 10:13:06] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27638880--1-xjwxh_processing_lifecycle-eviction-job-0f964a882f816acf8456c55456a8a9db5a89ee1baa5d03fe3a2ac837f1bc0d91.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/21 10:13:06] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27639360--1-tfj9q_processing_lifecycle-eviction-job-36fd27da2a1f65c319375a632aea83067921949cc80a81f7fddcabb098aaba30.log requires a larger buffer size, lines are too long. Skipping file.
[...]
We had these issues:
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-2qq4z -c fluent-bit | grep -v "requires a larger buffer size"
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/07/21 10:12:06] [error] [parser] parser named 'cri' already exists, skip.
[2022/07/21 10:12:10] [error] [output:kafka:kafka.5] fluent-bit#producer-6: [thrd:app]: fluent-bit#producer-6: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 7ms in state UP)
[2022/07/21 10:12:11] [error] [output:kafka:kafka.5] fluent-bit#producer-6: [thrd:app]: fluent-bit#producer-6: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 142ms in state UP, 1 identical error(s) suppressed)
[2022/07/21 11:29:55] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 11:40:31] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 11:41:44] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 12:16:55] [error] [output:kafka:kafka.6] fluent-bit#producer-7: [thrd:app]: fluent-bit#producer-7: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 43ms in state UP)
[2022/07/21 12:55:20] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 13:11:36] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 15:30:17] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:24:20] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:44:39] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:47:07] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:58:09] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[...]
Note: There is a stable lag on fluentbit. I don't know if trace are stuck in this lag and impact of it. I will observe my trace not received tomorro
IVV_CCB_2022_w30 : Accepted CS, priority major, investigations are longer, priority between minor and major but we chose major because RS core monitopring development should rely on all available traces.
@suberti-ads
Could you try to use the Exclude_Path
once again with the path /var/log/containers/*lifecycle-eviction*.log
?
Before apply WA i see following log on fluentbit pod
[2022/08/03 10:17:04] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:05] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full
[2022/08/03 10:17:05] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:05] [ warn] [engine] failed to flush chunk '1-1659521825.253348542.flb', retry in 11 seconds: task_id=11, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:06] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full
[2022/08/03 10:17:06] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:06] [error] [input:tail:tail.0] file=/var/log/containers/lifecycle-eviction-job-27648240--1-csn9z_processing_lifecycle-eviction-job-ec8cb73044f8fdf3fdd7aaf11ea76e7231005c8b3aad1cebac37084b76bb5
a6c.log requires a larger buffer size, lines are too long. Skipping file.
[2022/08/03 10:17:06] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27648240--1-csn9z_processing_lifecycle-eviction-job-ec8cb73044f8fdf3fdd7aaf11ea76e7231005c8b3aad1cebac37084b76bb5
a6c.log requires a larger buffer size, lines are too long. Skipping file.
[2022/08/03 10:17:07] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full
[2022/08/03 10:17:07] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:08] [ warn] [engine] chunk '1-1659521807.11339428.flb' cannot be retried: task_id=10, input=tail.4 > output=kafka.7
[2022/08/03 10:17:08] [ warn] [engine] failed to flush chunk '1-1659521828.465721474.flb', retry in 9 seconds: task_id=15, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:08] [ warn] [engine] failed to flush chunk '1-1659521817.846000867.flb', retry in 9 seconds: task_id=13, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:11] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full
[2022/08/03 10:17:11] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:12] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full
[2022/08/03 10:17:12] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:12] [ warn] [engine] failed to flush chunk '1-1659521832.650005727.flb', retry in 8 seconds: task_id=10, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:13] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full
it seems these error were provoke by fluentbit stuck on a trace
WA have been reapply: New configuration :
safescale gw-ops-cluster ../rs-processing-common kubectl get cm -n logging fluent-bit -o yaml | grep -A 7 -B 3 "Tag processing.log.*"
[INPUT]
Name tail
parser cri
Tag processing.log.*
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.log,/var/log/containers/nmap-job-*.log,/var/log/containers/lifecycle-eviction-job-*processing_lifecycle-eviction-job*.log
DB /var/log/fluentbit_processing.db
DB.Sync Normal
Rotate_Wait 10
Buffer_Max_Size 1MB
Mem_Buf_Limit 10MB
We will observe next week if trace still missing on this node.
@suberti-ads
The pattern you used is /var/log/containers/lifecycle-eviction-job-*processing_lifecycle-eviction-job*.log
, but the one we advised is /var/log/containers/*lifecycle-eviction*.log
.
Could you apply this last one?
The errors appearing in the logs are likely linked to the too long logs of the datalifecycle pod.
Yes sorry i applied old one Done, new configuration deployed:
safescale gw-ops-cluster ~ kubectl get cm -n logging fluent-bit -o yaml | grep -A 7 -B 3 "Tag processing.log.*"
[INPUT]
Name tail
parser cri
Tag processing.log.*
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.log,/var/log/containers/nmap-job-*.log,/var/log/containers/*lifecycle-eviction*.log
DB /var/log/fluentbit_processing.db
DB.Sync Normal
Rotate_Wait 10
Buffer_Max_Size 1MB
Mem_Buf_Limit 10MB
New occurrence No trace received since 19/07/2022:
still on node : ops-cluster-node-23
safescale gw-ops-cluster ~ kubectl get po -n processing -o wide | grep s3 | grep l0
s1pro-s3-l0-execution-worker-0 3/3 Running 0 13d 10.244.140.251 ops-cluster-node-23 <none> <none>
s1pro-s3-l0-execution-worker-1 3/3 Running 0 13d 10.244.140.252 ops-cluster-node-23 <none> <none>
s1pro-s3-l0-ipf-preparation-worker-0 3/3 Running 0 13d 10.244.38.158 ops-cluster-node-25 <none> <none>
s1pro-s3-l0-production-trigger-0 3/3 Running 0 13d 10.244.93.98 ops-cluster-node-5 <none> <none>
Hereafter log found on fluentbit:
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-ftw8m -c fluent-bit | grep "processing_lifecycle-eviction-job" | wc -l
3804
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-ftw8m -c fluent-bit | grep "processing_lifecycle-eviction-job" | head -1
[2022/08/05 11:57:10] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27648240--1-csn9z_processing_lifecycle-eviction-job-ec8cb73044f8fdf3fdd7aaf11ea76e7231005c8b3aad1cebac37084b76bb5a6c.log requires a larger buffer size, lines are too long. Skipping file.
safescale gw-ops-cluster ~ kubectl logs -n logging fluent-bit-ftw8m -c fluent-bit | grep -v "processing_lifecycle-eviction-job"
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/08/05 11:57:06] [error] [parser] parser named 'cri' already exists, skip.
[2022/08/05 11:57:09] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 115ms in state UP)
[2022/08/05 11:57:10] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 29ms in state UP)
[2022/08/05 11:59:04] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 11ms in state UP)
[2022/08/05 15:32:44] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 16ms in state UP)
[2022/08/05 17:39:08] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP)
[2022/08/05 18:17:52] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 4ms in state UP)
[2022/08/05 20:28:35] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 11ms in state UP)
[2022/08/05 21:26:44] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 21ms in state UP)
[2022/08/06 03:18:54] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/06 12:50:08] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 7ms in state UP)
[2022/08/06 13:29:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 15ms in state UP)
[2022/08/06 13:30:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP, 1 identical error(s) suppressed)
[2022/08/06 16:11:35] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 1ms in state UP)
[2022/08/06 17:45:50] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 24ms in state UP)
[2022/08/06 21:23:32] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP)
[2022/08/07 03:29:23] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 7ms in state UP)
[2022/08/07 03:54:14] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 25ms in state UP)
[2022/08/07 14:36:06] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 130ms in state UP)
[2022/08/07 14:51:09] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 19ms in state UP)
[2022/08/07 15:17:13] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 15:28:39] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 15:53:30] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 19:54:17] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 20:45:35] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 20:52:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 24ms in state UP)
[2022/08/07 20:54:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 13ms in state UP, 1 identical error(s) suppressed)
[2022/08/07 23:33:36] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/08 02:03:23] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
@suberti-ads Much work has been done on the reliability and performance of fluent-bit since versions 0.5.0, including buffer and ressources tuning, output failure recovery and update of fluent-bit to the latest 1.9.3 that includes also many fixes and improvements in the core of fluent-bit. I recommend that you upgrade your complete fluent-bit deployment to the latest validated release 0.9.0 before debugging any further. Be aware that the latest release of fluent-bit depends on the loki deployment from at least version 0.6.0.
@suberti-ads @pcuq-ads @Woljtek As demonstrated by Etienne, is the approach of waiting for 0.9.0 platform right for you? It's a shame to spend a lot of time working on 0.5.0 when the new OPS platform is coming Regards
@edepremare-cs OK, we follow your recommendation.
SYS_CCB_2023_w12 : no more fluent-bit issue.
Environment: Platform: RS V1.0.1 OPS Orange Cloud Configuration: rs-config branch op based on 0.5.0-rc1 MIS / rs-processing-config 0.4.0-rc2
Current Behavior: Some trace have been lost.
Expected Behavior:
Steps To Reproduce: We suspect this issue was due to data-lifecycle component:
Use following documentation to deply cron job which will delete product regurlaly https://github.com/COPRS/reference-system-documentation/blob/release/0.4.0/components/production%20common/Installation%20Manual.md#data-lifecycle-management
Hereafter job template used to configured data-lifecycle:
Whenever possible, first analysis of the root cause This issue seems to be link with data-lifecycle job
We observe trace lost on following s3-l0 execution and s3-l0 preparation:
These pods were on the same node (ops-cluster-node-23)
We note datalifecycle were host on same node:
On fluentbit we only see error due to lifecycle-eviction-job:
On node ops-cluster-node-23 log size is > 2 Mo
Hereafter fluentbit configuration:
logs lik like to this:
Hereafter fluentbit log configuration
I tried these following 2 solutions:
Rise Buffer_Max_Size
for these both case i still saw error on fluentbit log.
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)