rguske commented 3 years ago

Bug Report

Describe the bug Fluent Bit is not processing all logs located in /var/log/containers/.

To Reproduce The following messages are displayed:

[2021/10/01 14:40:05] [debug] [input:tail:tail.0] scanning path /var/log/containers/*.log                                                                                                                                                                                                                                               │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/activator-85cd6f6f9-nrncf_knative-serving_activator-3b631f27f6667599ae940f94afe6a65a4d1d488e7979fced513fa910082a5ae1.log, inode 404768                                                                                                │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/activator-85cd6f6f9-nrncf_knative-serving_activator-ca32320178170fe4198ce1b0bd57d8ea031c7c886a7b0e3d66bb1b78b67613b8.log, inode 921337                                                                                                │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-agent-63659cdc8e5ddba3eaf729be280661b45fd198e6d2c7195965be85cdca81f41a.log, inode 536837                                                                                                        │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-agent-8726abf73577f597e15716176cfcdce442b159d00ec12f59e439719d824a9585.log, inode 1190181                                                                                                       │
│ [2021/10/01 14:40:05] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/log/containers/antrea-agent-gql5r_kube-system_antrea-ovs-08045b767f2f8ee421b3b4d8d5b646b49b4e12199ae957cad178dd3d11670ec6.log, inode 663855

Steps to reproduce the problem: Configuration details:

ServiceAccount:

rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  - pods
  verbs:
  - get
  - list
  - watch

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: vmware-system
  labels:
    k8s-app: fluent-bit
apiVersion: v1
data:
  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
    [FILTER]
        Name                  modify
        Match                 kube.*
        Copy                  kubernetes k8s

    [FILTER]
        Name                  nest
        Match                 kube.*
        Operation             lift
        Nested_Under          kubernetes
  filter-record.conf: |
    [FILTER]
        Name                record_modifier
        Match               *
        Record tkg_cluster veba-demo.jarvis.tanzu
        Record tkg_instance veba-demo.jarvis.tanzu
    [FILTER]
        Name                  nest
        Match                 kube.*
        Operation             nest
        Wildcard              tkg_instance*
        Nest_Under            tkg

    [FILTER]
        Name                  nest
        Match                 kube_systemd.*
        Operation             nest
        Wildcard              SYSTEMD*
        Nest_Under            systemd
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-kubernetes.conf
    @INCLUDE input-systemd.conf
    @INCLUDE input-kube-apiserver.conf
    @INCLUDE input-auditd.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE filter-record.conf
    @INCLUDE output-syslog.conf
  input-auditd.conf: |
    [INPUT]
        Name              tail
        Tag               audit.*
        Path              /var/log/audit/audit.log
        Parser            logfmt
        DB                /var/log/flb_system_audit.db
        Mem_Buf_Limit     50MB
        Refresh_Interval  10
        Skip_Long_Lines   On
  input-kube-apiserver.conf: |
    [INPUT]
        Name              tail
        Tag               apiserver_audit.*
        Path              /var/log/kubernetes/audit.log
        Parser            json
        DB                /var/log/flb_kube_audit.db
        Mem_Buf_Limit     50MB
        Refresh_Interval  10
        Skip_Long_Lines   On
  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10
  input-systemd.conf: |
    [INPUT]
        Name                systemd
        Tag                 kube_systemd.*
        Path                /var/log/journal
        DB                  /var/log/flb_kube_systemd.db
        Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter      _SYSTEMD_UNIT=containerd.service
        Read_From_Tail      On
        Strip_Underscores   On
  output-syslog.conf: |
    [OUTPUT]
        Name   syslog
        Match  kube.*
        Host   10.197.79.57
        Port   514
        Mode   tcp
        Syslog_Format        rfc5424
        Syslog_Hostname_key  tkg_cluster
        Syslog_Appname_key   pod_name
        Syslog_Procid_key    container_name
        Syslog_Message_key   message
        Syslog_SD_key        k8s
        Syslog_SD_key        labels
        Syslog_SD_key        annotations
        Syslog_SD_key        tkg

    [OUTPUT]
        Name   syslog
        Match  kube_systemd.*
        Host   10.197.79.57
        Port   514
        Mode   tcp
        Syslog_Format        rfc5424
        Syslog_Hostname_key  tkg_cluster
        Syslog_Appname_key   tkg_instance
        Syslog_Message_key   MESSAGE
        Syslog_SD_key        systemd
  parsers.conf: |
    [PARSER]
        Name   apache
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache2
        Format regex
        Regex  ^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   apache_error
        Format regex
        Regex  ^\[[^ ]* (?<time>[^\]]*)\] \[(?<level>[^\]]*)\](?: \[pid (?<pid>[^\]]*)\])?( \[client (?<client>[^\]]*)\])? (?<message>.*)$

    [PARSER]
        Name   nginx
        Format regex
        Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name   json
        Format json
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        docker-daemon
        Format      regex
        Regex       time="(?<time>[^ ]*)" level=(?<level>[^ ]*) msg="(?<msg>[^ ].*)"
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        # http://rubular.com/r/tjUt3Awgg4
        Name cri
        Format regex
        Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        logfmt
        Format      logfmt

    [PARSER]
        Name        syslog-rfc5424
        Format      regex
        Regex       ^\<(?<pri>[0-9]{1,5})\>1 (?<time>[^ ]+) (?<host>[^ ]+) (?<ident>[^ ]+) (?<pid>[-0-9]+) (?<msgid>[^ ]+) (?<extradata>(\[(.*)\]|-)) (?<message>.+)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name        syslog-rfc3164-local
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S
        Time_Keep   On

    [PARSER]
        Name        syslog-rfc3164
        Format      regex
        Regex       /^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$/
        Time_Key    time
        Time_Format %b %d %H:%M:%S
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On

    [PARSER]
        Name    kube-custom
        Format  regex
        Regex   (?<tag>[^.]+)?\.?(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$

DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: vmware-system
  labels:
    k8s-app: fluent-bit
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: fluent-bit
  template:
    metadata:
      labels:
        k8s-app: fluent-bit
    spec:
      containers:
      - image: projects.registry.vmware.com/tkg/fluent-bit:v1.6.9_vmware.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 2020
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: fluent-bit
        ports:
        - containerPort: 2020
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/v1/metrics/prometheus
            port: 2020
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 80m
            memory: 200Mi
          requests:
            cpu: 50m
            memory: 100Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/log
          name: var-log
        - mountPath: /var/log/pods
          name: var-log-pods
        - mountPath: /var/log/containers
          name: var-log-containers
        - mountPath: /var/lib/docker/containers
          name: var-lib-docker-containers
          readOnly: true
        - mountPath: /fluent-bit/etc/
          name: fluent-bit-config
        - mountPath: /run/log
          name: systemd-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: fluent-bit
      serviceAccountName: fluent-bit
      terminationGracePeriodSeconds: 10
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /var/log
          type: ""
        name: var-log
      - hostPath:
          path: /var/log/pods
          type: ""
        name: var-log-pods
      - hostPath:
          path: /var/log/containers
          type: ""
        name: var-log-containers
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: var-lib-docker-containers
      - hostPath:
          path: /run/log
          type: ""
        name: systemd-log
      - configMap:
          defaultMode: 420
          name: fluent-bit-config
        name: fluent-bit-config
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

Expected behavior All logs in /var/log/containers/ should be processed.

Your Environment

Version used: 1.6.9, 1.7.9 and 1.8.9
Configuration: See above
Environment name and version Kubernetes Version: 1.20.2
Server type and version: Single K8s instance See: https://github.com/vmware-samples/vcenter-event-broker-appliance
Operating System and version: VMware PhotonOS v4
Filters and plugins: @INCLUDE input-kubernetes.conf @INCLUDE input-systemd.conf @INCLUDE input-kube-apiserver.conf @INCLUDE input-auditd.conf @INCLUDE filter-kubernetes.conf @INCLUDE filter-record.conf @INCLUDE output-syslog.conf

Additional context Running tail -f manually from within the system on a specific pod log, which is writing to stdout, works.

{"log":"10/03/2021 14:47:13 - Handler Processing Completed ...\n","stream":"stdout","time":"2021-10-03T14:47:13.829672574Z"}
{"log":"\n","stream":"stdout","time":"2021-10-03T14:47:13.829772103Z"}

Logs which e.g. aren't processed:

root@veba-kn [ /var/log/containers ]# ls -rtl                                                                                                                                                                                          
total 376                                                                                                                                                                                                                                                                                                                                                                                                                                     
lrwxrwxrwx 1 root root 100 Sep 13 21:31 antrea-agent-gql5r_kube-system_antrea-agent-8726abf73577f597e15716176cfcdce442b159d00ec12f59e439719d824a9585.log -> /var/log/pods/kube-system_antrea-agent-gql5r_31aa406a-286c-495b-9dcf-e4036c
2a4756/antrea-agent/3.log                                                                                                                                                                                                              
lrwxrwxrwx 1 root root  98 Sep 13 21:31 antrea-agent-gql5r_kube-system_antrea-ovs-3f300f1d7b28c069df1f34cf37ff89be95d69fc3dc4ea0f269b5bd07ce5d56c1.log -> /var/log/pods/kube-system_antrea-agent-gql5r_31aa406a-286c-495b-9dcf-e4036c2a
4756/antrea-ovs/3.log                                                                                                                                                                                                                  
lrwxrwxrwx 1 root root 102 Sep 13 21:31 envoy-89vct_contour-external_shutdown-manager-c8ed97927c25d465f31cce5ab8bd91d02742504f8cf73ad53e493738d0a17f74.log -> /var/log/pods/contour-external_envoy-89vct_1c947a55-2b86-48bd-b442-c6c51e
c2dd3a/shutdown-manager/3.log                                                                                                                                                                                                          
lrwxrwxrwx 1 root root  91 Sep 13 21:31 envoy-89vct_contour-external_envoy-0ea7a33d12105058f74eae9653dd0266ac99ef2ba7f6cb3a3b04a8ec3bc02525.log -> /var/log/pods/contour-external_envoy-89vct_1c947a55-2b86-48bd-b442-c6c51ec2dd3a/envo
y/3.log                                                                                                                                                                                                                                
lrwxrwxrwx 1 root root 104 Sep 13 21:31 contour-5869594b-7jm89_contour-external_contour-803e6591f657fae9539b64ae4f86fa44cce99b409c5f92979c6045cf4b98b52c.log -> /var/log/pods/contour-external_contour-5869594b-7jm89_cc6cf243-7d3f-483
9-91e8-741ab87f6488/contour/3.log                                                                                                                                                                                                      
lrwxrwxrwx 1 root root 106 Sep 13 21:31 contour-5d47766fd8-n24mz_contour-internal_contour-ae34a8ae0b8398da294c5061ec5c0ef1e9be8cb2979f07077e5e9df12f2bab67.log -> /var/log/pods/contour-internal_contour-5d47766fd8-n24mz_a87131ad-d73a
-4371-a47b-dcc410f3b6e4/contour/3.log                                                                                                                                                                                                  
lrwxrwxrwx 1 root root 100 Sep 13 21:31 coredns-74ff55c5b-mjdlr_kube-system_coredns-60bd5f49def85a0ddc929e2c2da5c793a3c6de00cd6a81bdcfdb21f3d4f45129.log -> /var/log/pods/kube-system_coredns-74ff55c5b-mjdlr_7ef260c1-308e-4162-8a84-2
31d560f8023/coredns/3.log

I've also tried running the DS in

securityContext:
          privileged: true

Similar issues I found but which doesn't provide the solution for this issue:

3857

4014

Your help would be much appreciated. Thanks

zy286938348 commented 2 years ago

@rguske I also have this problem, how did you solve it?

rguske commented 2 years ago

@rguske I also have this problem, how did you solve it?

Hi @zy286938348, I have no clue at the moment but I will follow some of the made suggestions here: https://fluent-all.slack.com/archives/C0CTQGHKJ/p1637005220493300 Please keep me posted if you have more luck with finding the solution for this particular problem. Thanks

rguske commented 2 years ago

Hi @zy286938348, which container runtime are you using?

kc-dot-io commented 2 years ago

@rguske @zy286938348 I am also seeing this issue using containerd as my runtime. It came up ~16 days after my first deployment of fluentbit 1.8.1 on a fresh k8s 1.21 cluster on EKS. My fluentbit configuration is basically the default helm chart install, with the kubernetes parser and thats about it. It looks like I can't get Slack thread above @rguske - let me know if I can share more info to help figure this out.

kc-dot-io commented 2 years ago

I think this is the RCA for me: https://github.com/fluent/fluent-bit/issues/3873

When I removed Ignore_Older the issues I was seeing went away quite quickly.

Please tag me if you are able to try this and confirm it works for you too.

rguske commented 2 years ago

Hi @kc-dot-io, I appreciate your comment and help. I wanted to update this issue already last week with my latest observations but last week was a busy one. However, it's interesting that you're mentioning that you are seeing this issue with containerd since I figured out, that changing the runtime from docker to containerd finally provided positive results. Subsequently, I was pretty sure it's due to the old dockershim. Regarding the use of Ignore_Older - I'm sure I've never added it to the CM but I will double check this next week. Again, thanks a lot.

teru01 commented 2 years ago

I also have this issue. Upgrading Fargate 1.3(using Docker as runtime) to 1.4 (using containerd) did not help.

kc-dot-io commented 2 years ago

Circling back to this and I'm no longer certain that Ignore_Older is the RCA for me & yes can confirm I'm using containerd.

The issue still appears to occur from time to time and while it's not limited to a single node usually, I have observed the circumstance where one node will work while another isn't. I'm now testing increasing the resources that are being allocated to the fluebt bit replicas since the issue appears to come and go.

If anyone has any more ideas, I'm happy to troubleshoot it with my setup. Please do ping me in follow ups.

srideok2 commented 2 years ago

Is there any updates on this issue? I am facing the same on K8s 1.21 deployment with Fluent Bit v1.8.12 and elasticsearch.

kc-dot-io commented 2 years ago

@srideok2 I don't think there is a permanent solution yet but I've talked to a few other people who had issues and we've had luck with reconfiguring our fluentbits to get to a place where things were stable.

From what I can see the issue mostly happens when buffer sizes get exceeded due to large log lines.

The solution I used and worked for at least one other person was to turn on the option for skipping long lines and then jack up the buffer size on the inputs and filters.

IRC there is another open issue that speaks to the issue at a lower level and I think there are multiple different symptoms of them same root cause here, which is basically that when the memory buffer is exceed, fluent-bit just silently stops working until it can be reset.

Ping me on Slack community if you have any questions. I hope this helps.

rguske commented 2 years ago

Hi all, it finally worked for our project after replacing Docker with Containerd and by applying the following config: https://github.com/vmware-samples/vcenter-event-broker-appliance/tree/development/files/configs/fluentbit/templates

ankitkr commented 2 years ago

Has this been resolved? This seems to be related to not removal of inode during file removal process i.e. (the inode entry not being removed from files_static, files_event[https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_config.h#L129, https://github.com/fluent/fluent-bit/blob/master/plugins/in_tail/tail_file.c#1108])

ramatudaan commented 2 years ago

@edsiper can you please give us some pointers here to debug this issue. As I have understood https://github.com/fluent/fluent-bit/blob/cd92cf36099e69c09b0448e6bc0032bd4ac81ac5/plugins/in_tail/tail_file.c#L861 this method is returning -1 and the files are getting dismissed. can you please help with some pointers on understanding when does this method return -1.

edsiper commented 2 years ago

There is a bug in v1.8.13, we have just published the new container image for v1.8.14, please upgrade:

fluent/fluent-bit:1.8.14

danielserrao commented 2 years ago

I'm using version 1.9.0 and I'm getting dismissed logs. This doesn't really seems fixed.

andutt commented 2 years ago

Same issue here. Openshift 4 CoreOS and cri-o. Any input regarding this would be great.

ramatudaan commented 2 years ago

@edsiper can you please give us some pointers here to debug this issue. As I have understood

https://github.com/fluent/fluent-bit/blob/cd92cf36099e69c09b0448e6bc0032bd4ac81ac5/plugins/in_tail/tail_file.c#L861

this method is returning -1 and the files are getting dismissed. can you please help with some pointers on understanding when does this method return -1.

This got fixed when we started using 1.9.0. @danielserrao

andutt commented 2 years ago

@ramatudaan i switched my fluent-bit image to 1.9.0 and my issue now seems to be security/permission related or missconfiguration of templates to match Openshift 4 CoreOS/cri-o. I created and issue here on it: here

Lord-Y commented 2 years ago

Hello guys, is there any ETA on this @edsiper? Sound like a lot of issues have been opened about this topic and debian packages 1.8.X are not available anymore. All 1.9.X packages are not fixing the issue.

ahvahsky2008 commented 2 years ago

Same issue guys, tail logging not works

kashirindynatech commented 2 years ago

Have same issue. Tried both output as http to logstash and es plugin to ES

sxyandapp commented 2 years ago

fluent-bit版本：1.9.3 问题：使用fluent-bit 的tail插件收集日志时，fluent-bit运行一段时间后，会停止继续收集日志问题原因：使用multiline.parser进行日志合并时，cont_state可能将若干条日志合并为一行，导致tail停止采集解决思路： multiline.parser进行日志合并时，受限使用start_state判断是否为开始行，然后使用cont_state判断后续行是否需要合并，因此state_state和cont_state应采用互斥的逻辑，正确示范： rule "start_state" "/^(\d{4})(.)/" "cont" rule "cont" "/^(?!\d{4})(.)/" "cont" 错误示范： rule "start_state" "/^(\d{4})(.)/" "cont" rule "cont" "/(.)/" "cont"

目前暂时没有遇到tail停止采集的问题

Jonathan-w6d commented 2 years ago

Has anyone tried turning Skip_Long_Lines to off ? Some of my logs were dismissed, because they were too long I guess, this did the trick

zulrang commented 2 years ago

I'm having this issue as well on 1.9.6. The problem is that I NEED the long lines to be processed.

kc-dot-io commented 2 years ago

@Jonathan-w6d yes this helped for me. Also increasing the max buffer size. That said you'll still run into an issue with a memory leak and eventually need to restart the process. I have to do this every 90-120 days.

@zulrang Try increasing your max buffer size to something very large - you'll probably still have to restart the processes every so often, but in theory if they buffer is large enough it should be able to process long lines until it can't again.

zulrang commented 2 years ago

@zulrang Try increasing your max buffer size to something very large - you'll probably still have to restart the processes every so often, but in theory if they buffer is large enough it should be able to process long lines until it can't again.

I have all buffers set to 2M and it has never successfully flushed chunks from one specific node - the one that is logging stack traces (about 8k long). The other 13 work fine.

scostelloppb commented 2 years ago

I'm having the same issue, tried all suggestions in this thread bar catainerd. Tail is dismissing the one and only log I want it to tail. Have tried downgrading to 1.6.10 and upgrading to various versions above up to latest. The problem occurs in every version I now try.

Eks 1.21, docker, Amazon Linux 2 node group, latest helm install from bitnami

kc-dot-io commented 2 years ago

I have all buffers set to 2M and it has never successfully flushed chunks from one specific node - the one that is logging stack traces (about 8k long). The other 13 work fine.

@zulrang did you try going higher? I went all the way up to ~25M using the tail input, k8s filter and skip long lines off before it was stable.

zulrang commented 2 years ago

Either there's something seriously wrong with FluentBit or I'm fundamentally misunderstanding how it works If it requires a 25M buffer to process a few 10k long lines

Jonathan-w6d commented 2 years ago

@kc-dot-io I have a 100M buffer, Skip_Long_Lines is Off and yesterday I had dismissed lines again, buffer was full, I had to rollout fluent bit. I don't like this.

kc-dot-io commented 2 years ago

@Jonathan-w6d Yeah it's definitely not ideal and that sounds quite similar to what I see. I still see the scan_dismissed in the logs, but I feel like that might be a red herring bc I don't seem to be missing anything meaningful once I jack the buffers and turn skip long lines off. From the research I did I think it all comes back to an issue I found that talks about a buffer overflow that was buriedin the issues in this repo. I was thinking about creating a k8s cron that restarts the deployment every 60 days or something. Not ideal but should mitigate. Thoughts on that?

Jonathan-w6d commented 2 years ago

@kc-dot-io I guess this would do the trick for you, but I really don't like this idea; with a stable, fully functioning tool you' would never have to do that. It was my understanding that Fluent bit was a stable tool, but I realise that it's not the case. The logs are sooo verbose yet the buffers keep filling up with no warning, lines get dismissed with no explanation whatsoever... The only answer in this issue from a staff member was on March 18th...

I think I'm going to have to get rid of fluent bit

ozhankaraman commented 2 years ago

I also lose my respect to this product because of this bug

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Jonathan-w6d commented 1 year ago

Not stale, the issue is still here and is still the reason most of us do not use fluent bit anymore to this day

marratj commented 1 year ago

I have the same issue, both with 1.9.x as well as with 2.0.x. Setting Max_Buffer_Size to whatever high value doesn't help at all. Also Skip_Long_Lines also doesn't make any difference.

Even when having no filter in between and simply routing to stdout, there's nothing. Just the scan_blog add(): dismissed debug message for the affected files and then nothing.

The lines in the affected files are not that long, the longest log line in those files currently is around 500 characters, a simple JSON string, I don't see how that should be a problem.

leonardo-albertovich commented 1 year ago

Do you think you could put together a repro case @marratj? That would make it much easier / faster to diagnose and fix.

marratj commented 1 year ago

It's a bit special in our case, as we are watching files on a filesystem mounted via CephFS inside a single Fluent-Bit Pod running on Kubernetes.

Read_from_Head outputs the initial content just fine, but then no updates to the files are processed.

However, I found a workaround that just works all of a sudden. When I set the Inotify_Watcher to False in the Tail input, it picks up all new lines written to the log files just fine. Debug logging is still putting out the scan_blog add(): dismissed messages for those files, but apart from that it's now working fine with the file stat watcher instead of inotify.

So in our case I can only assume that the inotify watcher is not properly working in conjunction with CephFS.

leonardo-albertovich commented 1 year ago

I don't know anything about CephFS but it doesn't surprise me that disabling inotify helps. I'll check the error message on Monday to see if it's related to it but I'm curious about the other users too, I wonder if they are using a similar setup.

ozhankaraman commented 1 year ago

CephFS is a network mounted filesystem, afaik inotify will not work because kernel is not aware of filesystem changes in a directory, so it can't inform any watch process, maybe it's the issue. If other users are using a network filesystem this could be the case

leonardo-albertovich commented 1 year ago

If there are people who are still struggling with this issue, please let us know so we can try to put together a reproduction case or have those of you who are able to run a custom build with additional logging.

From what I could see at first glance, the error string scan_blog add(): dismissed is only printed when flb_tail_file_append so :

Most of the exit paths of this function produce logs either directly (for most cases) or indirectly when set_file_pos fails
The only two paths that don't log any errors are the stat type check and the already tailed hash check.

The stat type check is also performed by the caller tail_scan_path and the stat instance is the same so I couldn't manage to get the code to take this path even artificially injecting delays and replacing what the link pointed to.

As for the hash check, I took the log failure log lines provided in the original report and since the function that creates the hash uses a combination of the device id and file inode I bruteforced the combinations of device id 1-255 with the inodes he provided but found no collisions (obviously this is an over simplification but it's a decent enough first approach).

thusharajayakody1 commented 1 year ago

having the same issue. logs are dropping entirely on one node.

patrick-stephens commented 1 year ago

There is a possibly related PR to improve collection from logs: https://github.com/fluent/fluent-bit/pull/6552 It doesn't necessarily resolve this (the symptoms are not the message above but general missing logs) and is in current master builds so should be in 2.0.9+ (for now you will have to use an non-production-ready master build).

john-yan commented 1 year ago

Hello all,

I am facing the same issue here using v2.0.8,

Runing fluent-bit on k3s on debian 11. All files are on native sata SSD with EXT4 filesystem.

I was able to identify the root cause to this readlink call from https://github.com/fluent/fluent-bit/blob/v2.0.8/plugins/in_tail/tail_file.c#L1543

I wrote a small c prog to prove the behaviour as follows

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/limits.h>
#include <errno.h>

int main() {
  char * path = "/var/log/containers/traefik-5758f4779c-g57zc_traefik_traefik-ad1fd81239e26f0cf1713977f526bb8b94be95062830d361ef0ff09ef9c22db3.log";
  FILE * f = fopen(path, "r");
  int fd = fileno(f);
  printf("file descriptor is %d\n", fd);
  char tmp[128];
  char * buf = malloc(PATH_MAX);
  if (!buf) {
    printf("can't malloc\n");
    return 0;
  }
  int ret = snprintf(tmp, sizeof(tmp) - 1, "/proc/%i/fd/%i", getpid(), fd);
  printf("ret=%d, tmp=%s\n", ret, tmp);

  // fluent bit's method of using /proc will fail
  int s = readlink(tmp, buf, PATH_MAX);
  printf("s=%d, err=%d\n", s, errno);

  // use the real path will work
  s = readlink(path, buf, PATH_MAX);
  printf("s=%d, err=%d\n", s, errno);
  return 0;

}

Here's the result:

root@fluent-bit-4gjpq:/# readlink /var/log/containers/traefik-5758f4779c-g57zc_traefik_traefik-ad1fd81239e26f0cf1713977f526bb8b94be95062830d361ef0ff09ef9c22db3.log
/var/log/pods/traefik_traefik-5758f4779c-g57zc_dd7776dd-b8a0-4acf-a130-ec10427ab19a/traefik/0.log
root@fluent-bit-4gjpq:/# ls -l /var/log/containers/traefik-5758f4779c-g57zc_traefik_traefik-ad1fd81239e26f0cf1713977f526bb8b94be95062830d361ef0ff09ef9c22db3.log
lrwxrwxrwx 1 root root 97 Jan 31 17:36 /var/log/containers/traefik-5758f4779c-g57zc_traefik_traefik-ad1fd81239e26f0cf1713977f526bb8b94be95062830d361ef0ff09ef9c22db3.log -> /var/log/pods/traefik_traefik-5758f4779c-g57zc_dd7776dd-b8a0-4acf-a130-ec10427ab19a/traefik/0.log
root@fluent-bit-4gjpq:/# gcc -o test test.c && ./test 
file descriptor is 3
ret=13, tmp=/proc/91/fd/3
s=-1, err=2
s=97, err=2

Further investigate the issue, I found removing the /proc mount will allow the readlink function to work properly. But the issue is that some other functions needs /proc to be mounted, eg. https://github.com/fluent/fluent-bit/blob/v2.0.8/plugins/in_cpu/cpu.c#L95.

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log
  # - name: proc
  #   hostPath:
  #     path: /proc
...
daemonSetVolumeMounts:
  - name: varlog
    mountPath: /var/log
  # - name: proc
  #   mountPath: /proc

Results:

root@fluent-bit-pb4gx:/# gcc -o test test.c && ./test  
file descriptor is 3
ret=13, tmp=/proc/28/fd/3
s=97, err=0
s=97, err=0

So in order to allow the tail plugin to work properly, we can't have other plugins that needs /proc to be mounted. It would be nice if we can mount /proc to a different path like /host_proc to avlid this issue. Maybe the temporary workaround is to have separate fluent-bit pods to collect different metrics?

ashutosh-maheshwari commented 1 year ago

Same issue in v2.0.10.

[2023/04/04 21:06:11] [debug] [input:tail:tail.3] scan_blog add(): dismissed: /logs/server.out, inode 277548
[2023/04/04 21:06:11] [debug] [input:tail:tail.4] scan_blog add(): dismissed: /logs/server.FATAL, inode 262809

Configuration:

    Buffer_Chunk_Size 10M
    Buffer_Max_Size 10M
    Skip_Long_Lines On
    Mem_Buf_Limit 30M

@edsiper, please let me know which stable version to use for the tail plugin. All the logs are getting dropped. So many people are looking for a solution to this bug.

ashutosh-maheshwari commented 1 year ago

Most versions of Fluent-bit are affected by this bug. I have tested versions 1.2.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 2.0.6, 2.0.8, and 2.0.10, and they all exhibited the same issue. The only versions that worked for me were 1.2.x, and 1.3.x. @edsiper @patrick-stephens Can you please review the code for the tail plugin in version 1.3.7? You might find the bug.

neugeeug commented 1 year ago

Hello,

We are facing the same issue using v2.1.1. Unfortunately it is a critical blocker for us :(

patrick-stephens commented 1 year ago

Is it the same issue identified here? https://github.com/fluent/fluent-bit/issues/4155#issuecomment-1416150129

The gist is a readlink failure with /proc mounts - is that what you have @neugeeug ?

neugeeug commented 1 year ago

Is it the same issue identified here? #4155 (comment)

The gist is a readlink failure with /proc mounts - is that what you have @neugeeug ?

No, we have just a simple setup, fluent-bit is working as a sidecar in AWS ECS cluster. After some time, we have realized that some long log entries are missing. Debug mode has shown lots of messages:

2023-04-28T17:12:10.580+02:00 | [2023/04/28 15:12:10] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /jcore/logs/config.log, inode 1748378
2023-04-28T17:12:10.580+02:00 | [2023/04/28 15:12:10] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /jcore/logs/connector.log, inode 1748381

We have been trying to play with memory configuration (Buffer_Max_Size, Buffer_Chunk_Size, Static_Batch_Size,Mem_Buf_Limit), and use an older version 1.8.14. Nothing has helped so far.

ashutosh-maheshwari commented 1 year ago

Is it the same issue identified here? #4155 (comment) The gist is a readlink failure with /proc mounts - is that what you have @neugeeug ?

No, we have just a simple setup, fluent-bit is working as a sidecar in AWS ECS cluster. After some time, we have realized that some long log entries are missing. Debug mode has shown lots of messages:
2023-04-28T17:12:10.580+02:00 | [2023/04/28 15:12:10] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /jcore/logs/config.log, inode 1748378
2023-04-28T17:12:10.580+02:00 | [2023/04/28 15:12:10] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /jcore/logs/connector.log, inode 1748381
We have been trying to play with memory configuration (Buffer_Max_Size, Buffer_Chunk_Size, Static_Batch_Size,Mem_Buf_Limit), and use an older version 1.8.14. Nothing has helped so far.

@neugeeug Can you try to use v 1.3.7?

fluent / fluent-bit

Fluent-bit input plugin tail doesn't process all logs: scan_blog add(): dismissed: #4155

Bug Report

3857

4014