COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [OPS] log from lifecycle-eviction-job were not well managed by fluentbit #455

Closed suberti-ads closed 1 year ago

suberti-ads commented 2 years ago

Environment: Platform: RS V1.0.1 OPS Orange Cloud Configuration: rs-config branch op based on 0.5.0-rc1 MIS / rs-processing-config 0.4.0-rc2

Current Behavior: Some trace have been lost.

Expected Behavior:

Steps To Reproduce: We suspect this issue was due to data-lifecycle component:

Use following documentation to deply cron job which will delete product regurlaly https://github.com/COPRS/reference-system-documentation/blob/release/0.4.0/components/production%20common/Installation%20Manual.md#data-lifecycle-management

Hereafter job template used to configured data-lifecycle:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: lifecycle-eviction-job
spec:
  schedule: "00 */4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: lifecycle-eviction-job
            image: docker.io/rancher/curlimages-curl:7.73.0
            imagePullPolicy: IfNotPresent
            command: ["/bin/sh", "-c"]
            args:
            - "curl --verbose -X DELETE -H 'Content-Type: application/json' -H 'ApiKey:XXXXXXXXXXXX' 'http://s1pro-datalifecycle-trigger-svc:8080/v1/products?operatorName=Automatic';
              curl -X POST -d 'hello=shutdown' 'http://localhost:4191/shutdown';"
          restartPolicy: Never

Whenever possible, first analysis of the root cause This issue seems to be link with data-lifecycle job

We observe trace lost on following s3-l0 execution and s3-l0 preparation:

MissingTraceS3l0

These pods were on the same node (ops-cluster-node-23)

 safescale  gw-ops-cluster  ~  kubectl get po -o wide -n processing  | grep s3-l0
s1pro-s3-l0-execution-worker-0                            3/3     Running     0              6d23h   10.244.140.227   ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-execution-worker-1                            3/3     Running     0              6d23h   10.244.140.1     ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-ipf-preparation-worker-0                      3/3     Running     0              6d23h   10.244.140.24    ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-production-trigger-0                          3/3     Running     0              6d23h   10.244.93.253    ops-cluster-node-5    <none>           <none>

We note datalifecycle were host on same node:

 safescale  gw-ops-cluster  ~  kubectl get po -o wide -A | grep ops-cluster-node-23
database          elasticsearch-processing-repository-config--1-cn5c5               0/1     Completed   0               90d     10.244.140.2      ops-cluster-node-23    <none>           <none>
database          prometheus-postgres-exporter-keycloak-5f4c879bb5-drgmv            2/2     Running     0               52d     10.244.140.233    ops-cluster-node-23    <none>           <none>
database          prometheus-postgres-exporter-scdf-dataflow-6f9b55794-x7z6t        2/2     Running     0               90d     10.244.140.6      ops-cluster-node-23    <none>           <none>
database          prometheus-postgres-exporter-scdf-skipper-5d7c9f6f46-q2lpn        2/2     Running     0               45d     10.244.140.242    ops-cluster-node-23    <none>           <none>
iam               openldap-prom-exporter-55cbcf4d4c-vl6bg                           2/2     Running     0               90d     10.244.140.5      ops-cluster-node-23    <none>           <none>
infra             cert-manager-webhook-ccb765b4-4pjsw                               2/2     Running     0               52d     10.244.140.228    ops-cluster-node-23    <none>           <none>
kube-system       calico-kube-controllers-7ff98576c9-sg48l                          1/1     Running     0               90d     192.168.32.210    ops-cluster-node-23    <none>           <none>
kube-system       calico-node-n2tgp                                                 1/1     Running     0               91d     192.168.32.210    ops-cluster-node-23    <none>           <none>
kube-system       kube-proxy-spqhh                                                  1/1     Running     0               51d     192.168.32.210    ops-cluster-node-23    <none>           <none>
kube-system       nginx-proxy-ops-cluster-node-23                                   1/1     Running     1 (91d ago)     91d     192.168.32.210    ops-cluster-node-23    <none>           <none>
kube-system       nodelocaldns-6nj5z                                                1/1     Running     0               91d     192.168.32.210    ops-cluster-node-23    <none>           <none>
logging           fluent-bit-nh4xz                                                  2/2     Running     0               16d     10.244.140.125    ops-cluster-node-23    <none>           <none>
monitoring        prometheus-stack-prometheus-node-exporter-9lnbg                   1/1     Running     0               90d     192.168.32.210    ops-cluster-node-23    <none>           <none>
networking        apisix-66c76f9c5-4mndm                                            2/2     Running     0               15d     10.244.140.146    ops-cluster-node-23    <none>           <none>
networking        apisix-dashboard-f7d9658fd-kpnjw                                  2/2     Running     4 (21d ago)     21d     10.244.140.39     ops-cluster-node-23    <none>           <none>
networking        apisix-dashboard-test-connection                                  0/1     Completed   0               23d     10.244.140.240    ops-cluster-node-23    <none>           <none>
networking        apisix-etcd-0                                                     2/2     Running     0               21d     10.244.140.219    ops-cluster-node-23    <none>           <none>
networking        apisix-ingress-controller-866bdd7ff-4xkzc                         2/2     Running     0               21d     10.244.140.37     ops-cluster-node-23    <none>           <none>
networking        apisix-ingress-controller-866bdd7ff-bdvwz                         2/2     Running     0               21d     10.244.140.54     ops-cluster-node-23    <none>           <none>
networking        linkerd-destination-59c7946695-pzzxk                              4/4     Running     0               21d     10.244.140.29     ops-cluster-node-23    <none>           <none>
networking        linkerd-identity-56c8796446-thzk8                                 2/2     Running     0               21d     10.244.140.28     ops-cluster-node-23    <none>           <none>
networking        tap-injector-846b5895c9-xjcbc                                     2/2     Running     0               52d     10.244.140.229    ops-cluster-node-23    <none>           <none>
processing        lifecycle-eviction-job-27630000--1-hvr2j                          0/2     Completed   0               20h     10.244.140.107    ops-cluster-node-23    <none>           <none>
processing        lifecycle-eviction-job-27630240--1-58x8p                          0/2     Completed   0               16h     10.244.140.102    ops-cluster-node-23    <none>           <none>
processing        lifecycle-eviction-job-27630480--1-w47z4                          0/2     Completed   0               12h     10.244.140.108    ops-cluster-node-23    <none>           <none>
processing        lifecycle-eviction-job-27630720--1-kdk92                          2/2     Running     0               8h      10.244.140.110    ops-cluster-node-23    <none>           <none>
processing        lifecycle-eviction-job-27630960--1-wgjwg                          2/2     Running     0               4h38m   10.244.140.111    ops-cluster-node-23    <none>           <none>
processing        lifecycle-eviction-job-27631200--1-s4k8j                          2/2     Running     0               38m     10.244.140.112    ops-cluster-node-23    <none>           <none>
processing        s1pro-s3-l0-execution-worker-0                                    3/3     Running     0               6d23h   10.244.140.227    ops-cluster-node-23    <none>           <none>
processing        s1pro-s3-l0-execution-worker-1                                    3/3     Running     0               6d23h   10.244.140.1      ops-cluster-node-23    <none>           <none>
processing        s1pro-s3-l0-ipf-preparation-worker-0                              3/3     Running     0               6d23h   10.244.140.24     ops-cluster-node-23    <none>           <none>
processing        s1pro-user-web-client-0                                           2/2     Running     0               7d15h   10.244.140.13     ops-cluster-node-23    <none>           <none>
rook-ceph         csi-rbdplugin-kkz7d                                               3/3     Running     0               78d     192.168.32.210    ops-cluster-node-23    <none>           <none>
rook-ceph         rook-ceph-operator-77957456f7-cx5lw                               2/2     Running     0               78d     10.244.140.123    ops-cluster-node-23    <none>           <none>
security          falco-bspk7                                                       2/2     Running     0               83d     10.244.140.86     ops-cluster-node-23    <none>           <none>
security          falco-exporter-mvz2s                                              2/2     Running     0               83d     10.244.140.87     ops-cluster-node-23    <none>           <none>
security          kibana-security-kb-667bbc6c4-vpjdc                                2/2     Running     0               63d     10.244.140.81     ops-cluster-node-23    <none>           <none>
security          nmap-job-27631080--1-x5dhf                                        0/1     Completed   0               158m    10.244.140.117    ops-cluster-node-23    <none>           <none>

On fluentbit we only see error due to lifecycle-eviction-job:

 safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-nh4xz -c fluent-bit | tail -5
[2022/07/15 08:41:45] [error] [input:tail:tail.0] file=/var/log/containers/lifecycle-eviction-job-27630240--1-58x8p_processing_lifecycle-eviction-job-f4b624c750efbb4b7da583be056d012c7f1618a19519d2adedddaf65edc42161.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 08:41:45] [error] [input:tail:tail.0] file=/var/log/containers/lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 08:41:45] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630000--1-hvr2j_processing_lifecycle-eviction-job-45864ad566f12641516fb1f0e465cf75d297aa33c22e65cdee9933ca7efa5b7c.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 08:41:45] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630240--1-58x8p_processing_lifecycle-eviction-job-f4b624c750efbb4b7da583be056d012c7f1618a19519d2adedddaf65edc42161.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 08:41:45] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log requires a larger buffer size, lines are too long. Skipping file.

On node ops-cluster-node-23 log size is > 2 Mo

Hereafter fluentbit configuration:

root@ops-cluster-node-23:/var/log/containers# ls -al lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log 
lrwxrwxrwx 1 root root 131 Jul 14 20:00 lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log -> /var/log/pods/processing_lifecycle-eviction-job-27630480--1-w47z4_fe11e39d-3037-4e6e-b334-aeb348efe97f/lifecycle-eviction-job/0.log
root@ops-cluster-node-23:/var/log/containers# cd /var/log/pods/processing_lifecycle-eviction-job-27630480--1-w47z4_fe11e39d-3037-4e6e-b334-aeb348efe97f/lifecycle-eviction-job/
root@ops-cluster-node-23:/var/log/pods/processing_lifecycle-eviction-job-27630480--1-w47z4_fe11e39d-3037-4e6e-b334-aeb348efe97f/lifecycle-eviction-job# du -h 0.log
2.7M    0.log

logs lik like to this:

2022-07-14T20:00:02.307339348Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-07-14T20:00:02.30744767Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
2022-07-14T20:00:02.308127413Z stderr F ^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 10.109.239.160:8080...
2022-07-14T20:00:02.308191237Z stderr F * Connected to s1pro-datalifecycle-trigger-svc (10.109.239.160) port 8080 (#0)
2022-07-14T20:00:02.308202724Z stderr F > DELETE /v1/products?operatorName=Automatic HTTP/1.1
2022-07-14T20:00:02.308210108Z stderr F > Host: s1pro-datalifecycle-trigger-svc:8080
2022-07-14T20:00:02.308214627Z stderr F > User-Agent: curl/7.73.0-DEV
2022-07-14T20:00:02.308218894Z stderr F > Accept: */*
2022-07-14T20:00:02.308229534Z stderr F > Content-Type: application/json
2022-07-14T20:00:02.30823275Z stderr F > ApiKey:LdbEo2020tffcEGS
2022-07-14T20:00:02.308236219Z stderr F > 
2022-07-15T05:45:14.951756375Z stdout F shutdown
2022-07-15T05:45:14.946748958Z stderr F ^M  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--
[...]
^M  0     0    0     0    0     0      0      0 --:--:--  9:45:11 --:--:--     0* Mark bundle as not supporting multiuse
2022-07-15T05:45:14.956566512Z stderr F < HTTP/1.1 200 OK
2022-07-15T05:45:14.956576433Z stderr F < date: Thu, 14 Jul 2022 20:00:02 GMT
2022-07-15T05:45:14.956581253Z stderr F < content-length: 0
2022-07-15T05:45:14.956584785Z stderr F < 
2022-07-15T05:45:14.956614341Z stderr F ^M  0     0    0     0    0     0      0      0 --:--:--  9:45:12 --:--:--     0^M  0     0    0     0    0     0      0      0 --:--:--  9:45:12 --:--:--     0
2022-07-15T05:45:14.956619981Z stderr F * Connection #0 to host s1pro-datalifecycle-trigger-svc left intact
2022-07-15T05:45:14.956624509Z stderr F   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-07-15T05:45:14.956631174Z stderr F                                  Dload  Upload   Total   Spent    Left  Speed
2022-07-15T05:45:14.956634859Z stderr F ^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100    23    0     9  100    14   9000  14000 --:--:-- --:--:-- --:--:-- 23000

Hereafter fluentbit log configuration

      [INPUT]
        Name             tail
        parser           cri
        Tag              processing.log.*
        Path             /var/log/containers/*.log
        Exclude_Path     /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.lo
g,/var/log/containers/nmap-job-*.log
        DB               /var/log/fluentbit_processing.db
        DB.Sync          Normal
        Rotate_Wait      10
        Buffer_Max_Size  1MB
        Mem_Buf_Limit    10MB

I tried these following 2 solutions:

Bug Generic Definition of Done (DoD)

wruffine-csgroup commented 2 years ago

From what I understand, your theory is that the lifecycle logs are too long and as a result, the processing traces were not taken into account by fluentbit. Am I correct?

suberti-ads commented 2 years ago

Yes, and due to these too long logs, other traces on same nodes were lost.

wruffine-csgroup commented 2 years ago

With my current understanding of fluentbit, I would say that when a file is too large, fluentbit skips it, but doesn't skip the other files that it is supposed to process. I would say it is more likely that the S3 traces are not picked up by fluentbit at all. Could you try excluding the data-lifecycle logs from fluentbit to check whether the other logs are still skipped by fluentbit?

I also notice there are logs during Thursday (the day the S3 data are processed) and not after. Can you confirm there are logs afterwards nonetheless?

suberti-ads commented 2 years ago

I don't understand, i tried both WA but i see error in fluentbit log despite of this.

Hereafter current configuration to exclude the data-lifecycle log:

      [INPUT]
        Name             tail
        parser           cri
        Tag              processing.log.*
        Path             /var/log/containers/*.log
        Exclude_Path     /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.log,/var/log/containers/nmap-job-*.log,/var/log/containers/lifecycle-eviction-job-*processing_lifecycle-eviction-job*.log
        DB               /var/log/fluentbit_processing.db
        DB.Sync          Normal
        Rotate_Wait      10
        Buffer_Max_Size  1MB
        Mem_Buf_Limit    10MB

but following log it still try to treat lifecycle data:

 safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-ww9tv -c fluent-bit | tail -5
[2022/07/15 13:09:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:09:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630720--1-kdk92_processing_lifecycle-eviction-job-531b27816fce7accb611e89fd581482adf207264a838ea26ba18c67c32220d09.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:10:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630240--1-58x8p_processing_lifecycle-eviction-job-f4b624c750efbb4b7da583be056d012c7f1618a19519d2adedddaf65edc42161.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:10:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630480--1-w47z4_processing_lifecycle-eviction-job-b1dd4ef070eb28ac312efe9e5fad0494e019971e9804c3a340d8cafd03449fbf.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/15 13:10:46] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27630720--1-kdk92_processing_lifecycle-eviction-job-531b27816fce7accb611e89fd581482adf207264a838ea26ba18c67c32220d09.log requires a larger buffer size, lines are too long. Skipping file.

I also notice there are logs during Thursday (the day the S3 data are processed) and not after. Can you confirm there are logs afterwards nonetheless? S3 Production occurred the tuesday between 00:00 and 07:59, some error have been restarted wednesday. on last session for 560 s3 l0 production only 197/560 preparation trace have been ingested in catalog. only 192/507 execution trace OK have been ingested in catalog. only 24/53 execution trace NOK have been ingested in catalog.

wruffine-csgroup commented 2 years ago

Have you restarted the daemonset once you have updated the configuration?

S3 Production occurred the tuesday between 00:00 and 07:59, some error have been restarted wednesday.

I meant, Tuesday, sorry. I believe these sessions are seen on the grafana dashboard. On Tuesday, there are logs from the production trigger to the preparation worker, and logs between the preparation worker and the execution worker. On Wednesday, for the sessions that were relaunched, there are logs for between the preparation worker and the execution worker only. Before restarting the fluentbit daemonset, can you check if the logs before the execution job notice an issue on S3 logs?

suberti-ads commented 2 years ago

Have you restarted the daemonset once you have updated the configuration?

I confim all fluentbit pods have been automatically restarted after configuration upgrade by ansible-playbook.

I meant, Tuesday, sorry. I believe these sessions are seen on the grafana dashboard. On Tuesday, there are logs from the production trigger to the preparation worker, and logs between the preparation worker and the execution worker. On Wednesday, for the sessions that were relaunched, there are logs for between the preparation worker and the execution worker only.

Sorry i don't unserstand. i trie I detect on s3 production missing data on elasticsearch. on attach file, i show trace based which count number of trigger / preparation and worker. For me trigger data were OK. i confirm there was 560 trigger execution. After we should have 560 preparation job execution but we count only 197 At the end we have also 560 execution OK but we see only 192 During session i detect more than 53 failed execution and i see only 24 ones

To identify issue, i focus on a missing trace :

For product S3A_SR_0_SRA__20220712T015840_20220712T020840_20220713T150144_0599_087_245____LN1_D_NR_002.SEN3 successfully generated by S3 L0 on pod s1pro-s3-l0-execution-worker-0 the 2022-07-13 at about 15:02:10.651000 .

i found trace manually on pod:

 safescale  gw-ops-cluster  ~  kubectl logs -n processing s1pro-s3-l0-execution-worker-0 -c s1pro-s3-l0-execution-worker  | grep "2022-07-13" | grep "End job processing" | grep "\"status\":\"OK\"" | grep S3A_SR_0_SRA____20220712T015840_20220712T020840_20220713T150144_0599_087_245______LN1_D_NR_002
{"header":{"type":"REPORT","timestamp":"2022-07-13T15:02:10.651000Z","level":"INFO","mission":"S3","workflow":"NOMINAL"},"message":{"content":"End job processing"},"task":{"uid":"6bdacebe-7b47-4c33-a03a-377dd37091fc","name":"JobProcessing","event":"END","status":"OK","output":{"debug":false,"filename_strings":["S3A_SR_0_SRA____20220712T015840_20220712T020840_20220713T150144_0599_087_245______LN1_D_NR_002.SEN3"]},"input":{"filename_strings":["S3A_SR_0_SRA__G_20220712T015858_20220712T020858_20220712T030708_0600______________SVL_O_NR_OPE.ISIP","S3A_AX___OSF_AX_20160216T192404_99991231T235959_20220330T090651___________________EUM_O_AL_001.SEN3","S3A_AX___FRO_AX_20220708T000000_20220718T000000_20220711T065229___________________EUM_O_AL_001.SEN3","JobOrder.9006.xml"],"job_order_id_string":"JobOrder.9006.xml","ipf_release_string":"06.12"},"quality":{},"error_code":0,"duration_in_seconds":49.554}}

But i don't found production trace on elastic S3executionMissingtrace

Before restarting the fluentbit daemonset, can you check if the logs before the execution job notice an issue on S3 logs?

as far i remember i don't see specific error which time period match with my s3 ipf execution.

suberti-ads commented 2 years ago

Today, I restart an ACQ error which generate some S3_L0. 59 New S3 L0 has been successfully generated but there is no trace of production. ACQ_L0Generation

But there is 59 Production done :

rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-17T23:59:59Z")},"group":"l0-s3-ipf-execution-worker","state":"ACK_OK"},{"dto.keyObjectStorage":1}).count();
59
rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-17T23:59:59Z")},"group":"l0-s3-ipf-execution-worker","state":"ACK_OK"},{"dto.keyObjectStorage":1,"lastSendDate":1});
{ "_id" : NumberLong(2320214), "lastSendDate" : ISODate("2022-07-18T11:35:44.532Z"), "dto" : { "keyObjectStorage" : "S3B_GN_0_GNS__G_20220712T034414_20220712T052638_20220718T112900_6144______________SVL_O_NR_OPE.ISIP" } }
{ "_id" : NumberLong(2320360), "lastSendDate" : ISODate("2022-07-18T11:36:22.934Z"), "dto" : { "keyObjectStorage" : "S3B_SR_0_SRA__G_20220712T035414_20220712T040414_20220718T112952_0600______________SVL_O_NR_OPE.ISIP" } }
[...]

There is no new logs on fluentbit (last error date from last friday):

  safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-ww9tv -c fluent-bit | tail -5
[2022/07/15 15:31:47] [ warn] [output:kafka:kafka.7] message delivery failed: Local: Unknown partition
[2022/07/15 15:31:47] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 14ms in state APIVERSION_QUERY, 4 identical error(s) suppressed)
[2022/07/15 15:46:00] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/15 16:04:13] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 95ms in state UP)
[2022/07/15 16:32:11] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
wruffine-csgroup commented 2 years ago

There are indeed connection errors on the fluentbit pod for the execution worker (node 22). It seems fluentbit hasn't been able to connect to kafka, which as result, has made it unable relay the traces. Not all fluentbit pods have this issue. Could you restart the fluentbit daemonset? It should eliminate some possibilities

suberti-ads commented 2 years ago

Some update were stuck due to lack of cpu on some nodes ( may be configuration to be update) After cordon/uncordon node and killing some pod. fluentbit have been restart on all nodes. Hereafter error find on node which host s3_l0 worker and data_lifecycle job:

ESC[1mFluent Bit v1.8.10ESC[0m
* ESC[1mESC[93mCopyright (C) 2019-2021 The Fluent Bit AuthorsESC[0m
* ESC[1mESC[93mCopyright (C) 2015-2018 Treasure DataESC[0m
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/07/18 16:25:21] [error] [parser] parser named 'cri' already exists, skip.
[2022/07/18 16:25:24] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 1042ms in state UP)
[2022/07/18 16:25:26] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 310ms in state UP, 1 identical error(s) suppressed)
[2022/07/18 16:25:28] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:app]: fluent-bit#producer-2: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP)
[2022/07/18 16:25:29] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:29] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:app]: fluent-bit#producer-2: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: veri
fy that security.protocol is correctly configured, broker might require SASL authentication (after 576ms in state UP, 1 identical error(s) suppressed)
[2022/07/18 16:25:30] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:30] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:31] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:32] [ warn] [input] tail.0 paused (mem buf overlimit)
[2022/07/18 16:25:32] [ warn] [input] tail.0 paused (mem buf overlimit)
[...]
wruffine-csgroup commented 2 years ago

The warning is not an issue. Fluentbit pauses the tail while it flushes out the buffer then starts again.

It seems traces are now reported again on Grafana, however, there is still a loss. Is there a way for you to define the traces that are lost (by using Kibana for instance), to determine how the fluentbit configuration should be changed?

Woljtek commented 2 years ago

IVV_CCB_2022_w28 : Move to Under analysis. The root cause is not understand. A workarounds is currently applied. We need more inputs to identify the scenarii of the bug. @suberti-ads : Please remove the applied WA which seems not to have any effect.

suberti-ads commented 2 years ago

Hereafter a trace lost. Bad there is logrotate on pod so i can't retrieve lost trace. This time on S1 AIO: There is 5 AIO OK but only 4 trace have been ingested in trace catalog:

rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-24T23:59:59Z")},"group":"l0-aio-ipf-execution-worker"},{"dto.keyObjectStorage":1,"state":1});
{ "_id" : NumberLong(2406590), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725010722044250" } }
{ "_id" : NumberLong(2407898), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725024554044251" } }
{ "_id" : NumberLong(2410955), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725042413044252" } }
{ "_id" : NumberLong(2411664), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725043407044252" } }
{ "_id" : NumberLong(2411899), "state" : "ACK_OK", "dto" : { "keyObjectStorage" : "S1A_20220725044105044252" } }

Missing trace for session for S1A_20220725043407044252

AIOTraceLost AIOTraceLost2

Herafter log found for this execution:

bash-4.2$ cat  logfile.log  | grep S1A_20220725043407044252 | head -1
2022-07-25T05:52:55.268 | INFO  | e.s.c.i.e.w.s.JobProcessor [pool-2-thread-1]: Initializing job processing {id: 2411664, inputKey: t-pdgs-aio-execution-jobs, body: IpfExecutionJob [productFamily=L0_JOB, keyObjectStorage=S1A_20220725043407044252, creationDate=Mon Jul 25 05:52:54 UTC 2022, hostname=s1pro-l0-aio-ipf-preparation-worker-0, productProcessMode=NOMINAL, workDirectory=/data/localWD/10002/, jobOrder=/data/localWD/10002/JobOrder.10002.xml, timeliness=, inputs=[{family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml, contentRef: S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF, contentRef: S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF, contentRef: S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF}, {family: JOB_ORDER, localPath: /data/localWD/10002/JobOrder.10002.xml, contentRef: <?xml version="1.0" encoding="UTF-8"?>
bash-4.2$ cat  logfile.log  | grep 2411664                           
2022-07-25T05:52:55.268 | INFO  | e.s.c.i.e.w.s.JobProcessor [pool-2-thread-1]: Initializing job processing {id: 2411664, inputKey: t-pdgs-aio-execution-jobs, body: IpfExecutionJob [productFamily=L0_JOB, keyObjectStorage=S1A_20220725043407044252, creationDate=Mon Jul 25 05:52:54 UTC 2022, hostname=s1pro-l0-aio-ipf-preparation-worker-0, productProcessMode=NOMINAL, workDirectory=/data/localWD/10002/, jobOrder=/data/localWD/10002/JobOrder.10002.xml, timeliness=, inputs=[{family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml, contentRef: S1A_OPER_AUX_OBMEMC_PDMC_20140201T000000.xml}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF, contentRef: S1A_OPER_MPL_ORBPRE_20220721T021252_20220728T021252_0001.EOF}, {family: AUXILIARY_FILE, localPath: /data/localWD/10002/S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF, contentRef: S1A_OPER_MPL_ORBSCT_20140507T150704_99999999T999999_0024.EOF}, {family: JOB_ORDER, localPath: /data/localWD/10002/JobOrder.10002.xml, contentRef: <?xml version="1.0" encoding="UTF-8"?>
2022-07-25T06:03:53.476 | INFO  | e.s.c.m.c.MqiConsumer [pool-2-thread-1]: MqiConsumer [category=LEVEL_JOBS] handled 2411664 successfully, done! 

pod was hosted on node and no error log present on fluentbit

 safescale  gw-ops-cluster  ~  kubectl get po -n processing -o wide | grep aio
s1pro-l0-aio-ipf-execution-worker-0                       3/3     Running     0             10d     10.244.9.26      ops-cluster-node-21   <none>           <none>
s1pro-l0-aio-ipf-preparation-worker-0                     3/3     Running     0             10d     10.244.240.42    ops-cluster-node-17   <none>           <none>
s1pro-l0-aio-production-trigger-0                         3/3     Running     0             10d     10.244.78.87     ops-cluster-node-3    <none>           <none>
 safescale  gw-ops-cluster  ~  kubectl get po -n logging -o wide | grep ops-cluster-node-21
fluent-bit-mhjpv   2/2     Running   0             4d7h   10.244.9.30      ops-cluster-node-21    <none>           <none>
 safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-mhjpv -c fluent-bit 
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/07/21 10:06:57] [error] [parser] parser named 'cri' already exists, skip.

Note:

suberti-ads commented 2 years ago

New Occurrence: This time behavior is the same behaviour describe in this issue. For the above comment i will create a specific issue later.

On S3 Production All processing trace from s3_l0 preparator and s3_l0 execution worker currently missing:

This week there was 565 S3_L0 execution All preparation and execution trace missing:

S3_l0Trace_missing

Hereafter state for it in mongo

rs0:PRIMARY> db.appDataJob.find({"creationDate":{"$gte": new ISODate("2022-07-25T23:59:59Z")},"pod":"s1pro-s3-l0-ipf-preparation-worker-0","state" : "TERMINATED"}).count();
565
rs0:PRIMARY> db.mqiMessage.find({"lastSendDate":{"$gte": new ISODate("2022-07-25T23:59:59Z")},"group":"l0-s3-ipf-execution-worker","state":"ACK_OK"}).count();
565

Example for S3-l0 : S3A_SR_0_SRA__20220726T042907_20220726T043907_20220726T054125_0599_088_061____LN1_D_NR_002.SEN3

Hereafter trace should be ingested by catalog:

 safescale  gw-ops-cluster  ~  kubectl logs -n processing s1pro-s3-l0-execution-worker-1  -c s1pro-s3-l0-execution-worker | grep "End job processing" |  grep S3A_SR_0_SRA____20220726T042907_20220726T043907_20220726T054125_0599_088_061______LN1_D_NR_002.SEN3
{"header":{"type":"REPORT","timestamp":"2022-07-26T05:41:57.280000Z","level":"INFO","mission":"S3","workflow":"NOMINAL"},"message":{"content":"End job processing"},"task":{"uid":"59addb12-1798-4b9b-be82-7107fa737259","name":"JobProcessing","event":"END","status":"OK","output":{"debug":false,"filename_strings":["S3A_SR_0_SRA____20220726T042907_20220726T043907_20220726T054125_0599_088_061______LN1_D_NR_002.SEN3"]},"input":{"filename_strings":["S3A_SR_0_SRA__G_20220726T042925_20220726T043925_20220726T053005_0600______________SVL_O_NR_OPE.ISIP","S3A_AX___OSF_AX_20160216T192404_99991231T235959_20220330T090651___________________EUM_O_AL_001.SEN3","S3A_AX___FRO_AX_20220722T000000_20220801T000000_20220725T065150___________________EUM_O_AL_001.SEN3","JobOrder.10386.xml"],"job_order_id_string":"JobOrder.10386.xml","ipf_release_string":"06.12"},"quality":{},"error_code":0,"duration_in_seconds":55.418999}}

No trace seen for s3 execution woker since the Jul 19, 2022 @ 08:45:27.109: NotraceexecutionWorker

All pod impacted were hosted on following node : ops-cluster-node-23

 safescale  gw-ops-cluster  ~  kubectl get po -o wide -n processing | grep  s3-l0
s1pro-s3-l0-execution-worker-0                            3/3     Running     0             11d     10.244.140.144   ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-execution-worker-1                            3/3     Running     0             11d     10.244.140.136   ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-ipf-preparation-worker-0                      3/3     Running     0             8d      10.244.140.165   ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-production-trigger-0                          3/3     Running     0             11d     10.244.136.77    ops-cluster-node-1    <none>           <none>

hereafter fluentbit pod which should get log and trace:

safescale  gw-ops-cluster  ~  kubectl get po -o wide -n logging | grep ops-cluster-node-23
fluent-bit-2qq4z   2/2     Running   0             5d6h   10.244.140.193   ops-cluster-node-23    <none>           <none>

On this except following loop error :

[2022/07/21 10:13:06] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27638880--1-xjwxh_processing_lifecycle-eviction-job-0f964a882f816acf8456c55456a8a9db5a89ee1baa5d03fe3a2ac837f1bc0d91.log requires a larger buffer size, lines are too long. Skipping file.
[2022/07/21 10:13:06] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27639360--1-tfj9q_processing_lifecycle-eviction-job-36fd27da2a1f65c319375a632aea83067921949cc80a81f7fddcabb098aaba30.log requires a larger buffer size, lines are too long. Skipping file.
[...]

We had these issues:

  safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-2qq4z -c fluent-bit | grep -v "requires a larger buffer size" 
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/07/21 10:12:06] [error] [parser] parser named 'cri' already exists, skip.
[2022/07/21 10:12:10] [error] [output:kafka:kafka.5] fluent-bit#producer-6: [thrd:app]: fluent-bit#producer-6: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 7ms in state UP)
[2022/07/21 10:12:11] [error] [output:kafka:kafka.5] fluent-bit#producer-6: [thrd:app]: fluent-bit#producer-6: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 142ms in state UP, 1 identical error(s) suppressed)
[2022/07/21 11:29:55] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 11:40:31] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 11:41:44] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 12:16:55] [error] [output:kafka:kafka.6] fluent-bit#producer-7: [thrd:app]: fluent-bit#producer-7: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 43ms in state UP)
[2022/07/21 12:55:20] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 13:11:36] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 15:30:17] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:24:20] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:44:39] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:47:07] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/07/21 16:58:09] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[...]

Note: There is a stable lag on fluentbit. I don't know if trace are stuck in this lag and impact of it. I will observe my trace not received tomorro Fluentbitlag

LAQU156 commented 2 years ago

IVV_CCB_2022_w30 : Accepted CS, priority major, investigations are longer, priority between minor and major but we chose major because RS core monitopring development should rely on all available traces.

wruffine-csgroup commented 2 years ago

@suberti-ads Could you try to use the Exclude_Path once again with the path /var/log/containers/*lifecycle-eviction*.log?

suberti-ads commented 2 years ago

Before apply WA i see following log on fluentbit pod

[2022/08/03 10:17:04] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:05] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full

[2022/08/03 10:17:05] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:05] [ warn] [engine] failed to flush chunk '1-1659521825.253348542.flb', retry in 11 seconds: task_id=11, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:06] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full

[2022/08/03 10:17:06] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:06] [error] [input:tail:tail.0] file=/var/log/containers/lifecycle-eviction-job-27648240--1-csn9z_processing_lifecycle-eviction-job-ec8cb73044f8fdf3fdd7aaf11ea76e7231005c8b3aad1cebac37084b76bb5
a6c.log requires a larger buffer size, lines are too long. Skipping file.
[2022/08/03 10:17:06] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27648240--1-csn9z_processing_lifecycle-eviction-job-ec8cb73044f8fdf3fdd7aaf11ea76e7231005c8b3aad1cebac37084b76bb5
a6c.log requires a larger buffer size, lines are too long. Skipping file.
[2022/08/03 10:17:07] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full

[2022/08/03 10:17:07] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:08] [ warn] [engine] chunk '1-1659521807.11339428.flb' cannot be retried: task_id=10, input=tail.4 > output=kafka.7
[2022/08/03 10:17:08] [ warn] [engine] failed to flush chunk '1-1659521828.465721474.flb', retry in 9 seconds: task_id=15, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:08] [ warn] [engine] failed to flush chunk '1-1659521817.846000867.flb', retry in 9 seconds: task_id=13, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:11] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full

[2022/08/03 10:17:11] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:12] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full

[2022/08/03 10:17:12] [ warn] [output:kafka:kafka.7] internal queue is full, retrying in one second
[2022/08/03 10:17:12] [ warn] [engine] failed to flush chunk '1-1659521832.650005727.flb', retry in 8 seconds: task_id=10, input=tail.4 > output=kafka.7 (out_id=7)
[2022/08/03 10:17:13] [error] % Failed to produce to topic fluentbit.ingress: Local: Queue full

it seems these error were provoke by fluentbit stuck on a trace

WA have been reapply: New configuration :

 safescale  gw-ops-cluster  ../rs-processing-common  kubectl get cm -n logging fluent-bit -o yaml | grep -A 7 -B 3 "Tag              processing.log.*" 
    [INPUT]
      Name             tail
      parser           cri
      Tag              processing.log.*
      Path             /var/log/containers/*.log
      Exclude_Path     /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.log,/var/log/containers/nmap-job-*.log,/var/log/containers/lifecycle-eviction-job-*processing_lifecycle-eviction-job*.log
      DB               /var/log/fluentbit_processing.db
      DB.Sync          Normal
      Rotate_Wait      10
      Buffer_Max_Size  1MB
      Mem_Buf_Limit    10MB

We will observe next week if trace still missing on this node.

wruffine-csgroup commented 2 years ago

@suberti-ads The pattern you used is /var/log/containers/lifecycle-eviction-job-*processing_lifecycle-eviction-job*.log, but the one we advised is /var/log/containers/*lifecycle-eviction*.log. Could you apply this last one?

The errors appearing in the logs are likely linked to the too long logs of the datalifecycle pod.

suberti-ads commented 2 years ago

Yes sorry i applied old one Done, new configuration deployed:

 safescale  gw-ops-cluster  ~  kubectl get cm -n logging fluent-bit -o yaml | grep -A 7 -B 3 "Tag              processing.log.*" 
    [INPUT]
      Name             tail
      parser           cri
      Tag              processing.log.*
      Path             /var/log/containers/*.log
      Exclude_Path     /var/log/containers/*fluent*.log,/var/log/containers/keycloak-?_iam_keycloak-*.log,/var/log/containers/apisix-*_networking_apisix-*.log,/var/log/containers/falco-?????_security_falco*.log,/var/log/containers/nmap-job-*.log,/var/log/containers/*lifecycle-eviction*.log
      DB               /var/log/fluentbit_processing.db
      DB.Sync          Normal
      Rotate_Wait      10
      Buffer_Max_Size  1MB
      Mem_Buf_Limit    10MB
suberti-ads commented 2 years ago

New occurrence No trace received since 19/07/2022: LastTraceS3L0

still on node : ops-cluster-node-23

 safescale  gw-ops-cluster  ~  kubectl get po -n processing -o wide | grep s3 | grep l0
s1pro-s3-l0-execution-worker-0                            3/3     Running     0             13d     10.244.140.251   ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-execution-worker-1                            3/3     Running     0             13d     10.244.140.252   ops-cluster-node-23   <none>           <none>
s1pro-s3-l0-ipf-preparation-worker-0                      3/3     Running     0             13d     10.244.38.158    ops-cluster-node-25   <none>           <none>
s1pro-s3-l0-production-trigger-0                          3/3     Running     0             13d     10.244.93.98     ops-cluster-node-5    <none>           <none>

Hereafter log found on fluentbit:

safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-ftw8m -c fluent-bit | grep "processing_lifecycle-eviction-job" | wc -l
3804

 safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-ftw8m -c fluent-bit | grep "processing_lifecycle-eviction-job" | head -1
[2022/08/05 11:57:10] [error] [input:tail:tail.1] file=/var/log/containers/lifecycle-eviction-job-27648240--1-csn9z_processing_lifecycle-eviction-job-ec8cb73044f8fdf3fdd7aaf11ea76e7231005c8b3aad1cebac37084b76bb5a6c.log requires a larger buffer size, lines are too long. Skipping file.

 safescale  gw-ops-cluster  ~  kubectl logs -n logging fluent-bit-ftw8m -c fluent-bit | grep -v "processing_lifecycle-eviction-job" 
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/08/05 11:57:06] [error] [parser] parser named 'cri' already exists, skip.
[2022/08/05 11:57:09] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 115ms in state UP)
[2022/08/05 11:57:10] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: fluent-bit#producer-1: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 29ms in state UP)
[2022/08/05 11:59:04] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 11ms in state UP)
[2022/08/05 15:32:44] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 16ms in state UP)
[2022/08/05 17:39:08] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP)
[2022/08/05 18:17:52] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 4ms in state UP)
[2022/08/05 20:28:35] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 11ms in state UP)
[2022/08/05 21:26:44] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 21ms in state UP)
[2022/08/06 03:18:54] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/06 12:50:08] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 7ms in state UP)
[2022/08/06 13:29:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 15ms in state UP)
[2022/08/06 13:30:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP, 1 identical error(s) suppressed)
[2022/08/06 16:11:35] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 1ms in state UP)
[2022/08/06 17:45:50] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 24ms in state UP)
[2022/08/06 21:23:32] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 0ms in state UP)
[2022/08/07 03:29:23] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 7ms in state UP)
[2022/08/07 03:54:14] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-2.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/2: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 25ms in state UP)
[2022/08/07 14:36:06] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 130ms in state UP)
[2022/08/07 14:51:09] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-0.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/0: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 19ms in state UP)
[2022/08/07 15:17:13] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 15:28:39] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 15:53:30] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 19:54:17] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 20:45:35] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/07 20:52:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 24ms in state UP)
[2022/08/07 20:54:11] [error] [output:kafka:kafka.7] fluent-bit#producer-8: [thrd:app]: fluent-bit#producer-8: kafka-cluster-kafka-1.kafka-cluster-kafka-brokers.infra.svc.cluster.local:9092/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after 13ms in state UP, 1 identical error(s) suppressed)
[2022/08/07 23:33:36] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
[2022/08/08 02:03:23] [error] [filter:kubernetes:kubernetes.11] kubelet upstream connection error
edepremare-cs commented 2 years ago

@suberti-ads Much work has been done on the reliability and performance of fluent-bit since versions 0.5.0, including buffer and ressources tuning, output failure recovery and update of fluent-bit to the latest 1.9.3 that includes also many fixes and improvements in the core of fluent-bit. I recommend that you upgrade your complete fluent-bit deployment to the latest validated release 0.9.0 before debugging any further. Be aware that the latest release of fluent-bit depends on the loki deployment from at least version 0.6.0.

cbouisson-csgroup commented 2 years ago

@suberti-ads @pcuq-ads @Woljtek As demonstrated by Etienne, is the approach of waiting for 0.9.0 platform right for you? It's a shame to spend a lot of time working on 0.5.0 when the new OPS platform is coming Regards

Woljtek commented 2 years ago

@edepremare-cs OK, we follow your recommendation.

pcuq-ads commented 1 year ago

SYS_CCB_2023_w12 : no more fluent-bit issue.