fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

Fluent-bit logs skip issue : scan_blog add(): dismissed #4014

Open mitesh-devops opened 3 years ago

mitesh-devops commented 3 years ago

I would request some help for following fluent-bit logs which is causing logs skip for our many apps.It works correctly for few hours of the restart of fluent-bit.

logs 1:

[2021/08/27 08:01:29] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/container-name-1627776000-g52f7_prod-apps_container-name-4e462ab4ec40adb0c42bece9c054e16466ba444afd1f0c2634be03e3b86613d1.log, inode 533287
[2021/08/27 08:01:29] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/container-name-1630049400-w2hdn_prod-apps_container-name-8cea19f5e062b0d9b267f5ac3db0f8d74b0e690c62c5448a8b6dd16c2df1195f.log, inode 1605932

frequency: around 15-20 entries every 5 minutes when container starts and after few hours, it's the only logs which fluentbit prints and all logs get stopped.

logs 2:

[2021/08/27 08:01:42] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1606300 watch_fd=26
[2021/08/27 08:01:42] [debug] [input:tail:tail.0] inode=1605129 events: IN_ATTRIB 
[2021/08/27 08:01:42] [debug] [input:tail:tail.0] inode=1605129 file has been deleted: /var/log/containers/container-name-1630049400-g7mr8_prod-apps_container-name-7fcddb34d4164fbca29a342c75430d5f30ba0fb14e1ccdd183dbac82ef70e784.log
[2021/08/27 08:01:42] [debug] [input:tail:tail.0] db: file deleted from database: /var/log/containers/container-name-1630049400-g7mr8_prod-apps_container-name-7fcddb34d4164fbca29a342c75430d5f30ba0fb14e1ccdd183dbac82ef70e784.log

frequency: very few times

log 3:

[2021/08/27 12:30:34] [debug] [socket] could not validate socket status for #146 (don't worry)
[2021/08/27 12:30:57] [debug] [socket] could not validate socket status for #49 (don't worry)

frequency: Sometimes only.

Rest of the logs looks normal and correctly gathering and writing to elasticsearch.

Cluster Information:

  1. kubernetes nodes with around 20 nodes with fluentd deamonset.
  2. Elasticsearch 5 master-slave nodes:
  3. master nodes:
  4. 28GB RAM(per node)
  5. CPU 3 core(per node)
  6. elasticsearch:7.9.2
  7. fluent-bit version: 1.7

Expected behaviour: It shouldn't skip logs and should work without restart after sometime.

fluent-bit.conf

apiVersion: v1
data:
  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           Off
        Buffer_Size         0
        #Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
  fluent-bit.conf: |
    [SERVICE]
        Flush         3
        #Grace         120
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf
        #storage.path  /var/log/flb-storage/
        #storage.sync  normal
        #storage.checksum off
        #storage.backlog.mem_limit 20M
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-elasticsearch.conf
  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Exclude_Path      /var/log/containers/*_kube-system_*.log,/var/log/containers/*_log_*.log,/var/log/containers/*_mog_*.log,/var/log/containers/*_pd_*.log,/var/log/containers/*_kubecost_*.log,/var/log/containers/*_elastic-system_*.log,/var/log/containers/*_istio-*.log,/var/log/containers/*_cs_*.log,/var/log/containers/*_kube-public_*.log,/var/log/containers/*_ss_*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     512MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        # Control the log line length
        Buffer_Chunk_Size 10MB
        Buffer_Max_Size   40MB
        # Using the docker mode to deal with multiline messages emitted by docker
        Docker_Mode       On
        #storage.type      filesystem
        Rotate_Wait      30
  output-elasticsearch.conf: |
    [OUTPUT]
        Name            es
        Match           *
        Host            ${FLUENT_ELASTICSEARCH_HOST}
        Port            ${FLUENT_ELASTICSEARCH_PORT}
        HTTP_User       ${FLUENT_ELASTICSEARCH_USER}
        HTTP_Passwd     ${FLUENT_ELASTICSEARCH_PASSWORD}
        Logstash_Format On
        Logstash_Prefix newprod
        Replace_Dots    On
        Retry_Limit     False
        tls             On
        tls.verify      Off
  parsers.conf: |
    [PARSER]
        Name   apache
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache2
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache_error
        Format regex
        Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name   nginx
        Format regex
        Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        # http://rubular.com/r/tjUt3Awgg4
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S

fluent-bit.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
  labels:
    k8s-app: fluent-bit-logging
    kubernetes.io/cluster-service: "true"
    version: v1
  name: fluent-bit
  namespace: logging
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: fluent-bit-logging
  template:
    metadata:
      annotations:
        prometheus.io/path: /api/v1/metrics/prometheus
        prometheus.io/port: "2020"
        prometheus.io/scrape: "true"
      labels:
        k8s-app: fluent-bit-logging
        kubernetes.io/cluster-service: "true"
        version: v1
    spec:
      containers:
      - env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: ******
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        - name: FLUENT_ELASTICSEARCH_USER
          value: *******
        - name: FLUENT_ELASTICSEARCH_PASSWORD
          value: ********
        - name: FLUENT_ELASTICSEARCH_SCHEME
          value: https
        image: fluent/fluent-bit:1.7
        imagePullPolicy: Always
        name: fluent-bit
        ports:
        - containerPort: 2020
          protocol: TCP
        resources:
          requests:
            cpu: 250m
            memory: 750Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log
          name: varlog
        - mountPath: /var/lib/docker/containers
          name: varlibdockercontainers
          readOnly: true
        - mountPath: /fluent-bit/etc/
          name: fluent-bit-config
      nodeSelector:
        logging: "true"
      restartPolicy: Always
      securityContext: {}
      serviceAccount: fluent-bit
      serviceAccountName: fluent-bit
      terminationGracePeriodSeconds: 10
      volumes:
      - hostPath:
          path: /var/log
          type: ""
        name: varlog
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: varlibdockercontainers
      - configMap:
          defaultMode: 420
          name: fluent-bit-config
        name: fluent-bit-config
github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 5 days with no activity.

rguske commented 3 years ago

Hi @mitesh-devops, were you able to solve this issue? I'm facing the same, see #4155

kc-dot-io commented 2 years ago

Same here.

soanni commented 2 years ago

same here

mitesh-devops commented 2 years ago

Hi, Actually, it was more issue of the output plugin. In my case, output pluging was Elasticsearch so sizing the shards correctly and including rollover ILM policies helped me to get this fixed. Also, try adding more RAM to diagnose.

joebatish commented 2 years ago

any solution regrading this ?

joebatish commented 2 years ago

@kc-dot-io Did you find solution for this ?

tr2vil commented 2 years ago

I had same issue. But I think, may be I found which is main factor. I think Refresh_Interval, Buffer_Max_Size and Buffer_Chunk_Size is key-factor. I fixed this configuration after then there are not exist same error.

ankitkr commented 2 years ago

Has this been resolved? This seems to be related to not removal of inode during file removal process i.e. (the inode entry not being removed from files_static, files_event[https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_config.h#L129, https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_file.c#1108])

rguske commented 2 years ago

Hi all, it finally worked for our project after replacing Docker with Containerd and by applying the following config: https://github.com/vmware-samples/vcenter-event-broker-appliance/tree/development/files/configs/fluentbit/templates

https://github.com/fluent/fluent-bit/issues/4155#issuecomment-1046239927

edsiper commented 2 years ago

There is a bug in v1.8.13, we have just published the new container image for v1.8.14, please upgrade:

fluent/fluent-bit:1.8.14
olastor commented 2 years ago

Same here (v1.9.1)

Lord-Y commented 2 years ago

Still having the issue with v1.9.2 too. any fix soon?

maiconbaum commented 2 years ago

Same in 1.9.3

ahvahsky2008 commented 2 years ago

Same issue guys

kaleyroy commented 2 years ago

Same in v1.9.3

kaleyroy commented 2 years ago

image

RayMeibergen commented 2 years ago

Same in 1.9.4

CallumDowling commented 2 years ago

Any resolution for this, processes a few logs then falls over ? Docker image: fluent/fluent-bit:1.9.3-debug

Jonathan-w6d commented 2 years ago

Has anyone tried turning Skip_Long_Lines to off ? Some of my logs were dismissed, because they were too long I guess, this did the trick

kc-dot-io commented 2 years ago

@Jonathan-w6d yes. This will help for a while, depending on how much data you are processing. The root issue is a memory leak. Increasing buffer / skipping long lines helps get it working but the issue will still reoccur every so often. I have to restart our fluentbit pods every 90-100 days or so.

ashutosh-maheshwari commented 1 year ago

The same issue is in v2.0.10.

[2023/04/04 21:06:11] [debug] [input:tail:tail.3] scan_blog add(): dismissed: /logs/server.out, inode 277548
[2023/04/04 21:06:11] [debug] [input:tail:tail.4] scan_blog add(): dismissed: /logs/server.FATAL, inode 262809
curvesoft commented 1 year ago

The same issue is in v2.1.1. [2023/05/22 23:34:52] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /tomcat-logs/platform-2023-05-22.debug.log, inode 57744220 [2023/05/22 23:34:52] [debug] [input:tail:tail.0] 0 new files found on path '/tomcat-logs/platform*.log'

gorexlv commented 1 year ago

Same issue, has this been resolved?

IbrahimMCode commented 1 year ago

Same issue using latest version

LiadBerko commented 1 year ago

I faced the same question. Since tail input only reads the last offset of the file - make sure you are actually writing to the correct files that you are reading from, that your filters are correct and that the lines are not too long (because you have Skip_Long_Lines turned on). You can check yourself by using Read_from_Head set to true under [INPUT]

Varrkan82 commented 1 year ago

Any updates? Because the same thing happened on v2.1.10

ardole commented 1 year ago

Still same issue here

omidraha commented 1 year ago

I Have the same issue https://github.com/fluent/helm-charts/issues/415:

kubectl logs  -n fluent-bit -f fluent-bit-6341f45a-j8h2x | grep -i volume-test
[2023/10/27 17:04:05] [debug] [input:tail:tail.0] inode=1062251 with offset=2026 appended as /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log
[2023/10/27 17:04:05] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:04:05] [debug] [input:tail:tail.0] inode=1062251 file=/var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log promote to TAIL_EVENT
[2023/10/27 17:04:05] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1062251 watch_fd=23 name=/var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log
[2023/10/27 17:05:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:06:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:07:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
[2023/10/27 17:08:04] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/volume-test_exp_volume-test-2c25a3b1342924cffab9fffb48b2f0a971fcf9f10009e4c382a37bc09075134b.log, inode 1062251
nagyzekkyandras commented 9 months ago

same here with version 2.2.1, have any solution?

rnemeth90 commented 6 months ago

Is anyone actually going to fix this? This issue needs to be re-opened.

wdonne commented 6 months ago

I still have this with version 3.0.3.

daniela-cyera commented 5 months ago

Please re-open the issue, seems to be persistent in all recent versions

jisorondo commented 5 months ago

Can we please get an update on this issue? Thanks!

Varrkan82 commented 4 months ago

Switched to filebeat as fluent-bit unusable in production due to this issue. Sorry guys, but 3 years of nothing is a little strange... @edsiper maybe you need to look on this?

leonardo-albertovich commented 2 months ago

Hi folks, I have create this branch where I've added some trace level debug messages for some code paths that could slip through the cracks and I'm currently creating a set of container images that you'll be able to access in ghcr.io/fluent/fluent-bit/test/leonardo-master-issue_4014_debug_addition once the packaging process is complete.

I'd really appreciate it if those of you who are currently able to reproduce the issue took the time to either build from source or run those containers (setting log_level to trace) and share those logs with me (You can find me in the public slack server as Leonardo Alminana if you'd rather share that information privately) to expedite the troubleshooting process.

leonardo-albertovich commented 2 months ago

Quick update: the container images are already available so please, it's really important for at least someone to run them and report the results.

rnemeth90 commented 2 months ago

I'll try to find some time this week to deploy the image @leonardo-albertovich . Where are the logs output to? Stdout/err? Or can I define a sink?

leonardo-albertovich commented 2 months ago

The logs are produced in the standard way so you don't need to do anything special (other than setting log_level to trace)

srajappa commented 1 month ago

I am seeing the same issue

mohannara commented 3 weeks ago

I'm still getting the below error. However, the issue was raised when multiline processing was enabled. what could be the issue? image -> public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.2.20241008 (Fluent Bit v1.9.10) [2024/10/16 04:03:58] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/master-ms-deployment-5db5876558-w5gjj_master-ms-e2927e4cfeb05a89f229d7622edaefc750e48568d65415e983980535031ddd41.log, inode 75503248