fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.58k forks source link

Tail input plugin not fetching all logs when files are rotated quickly under heavy load #2110

Open thepythonista opened 4 years ago

thepythonista commented 4 years ago

Bug Report

Describe the bug Tail input plugin not able to tail files when the file rotation happens. Used a container that generates 1,000,000 lines that log it to stdout. configured fluent-bit to tail the logs files and print it to standard output. The tail plugin is only able to get ~200000 lines (only from the first file). The plugin is not able to fetch lines after the rotation happens under heavy load.

To Reproduce

Expected behavior Generate 1,000,000 logs lines and log it to standard output and expect fluent-bit to tail and fetch all those 1,000,000 lines.

Your Environment

Additional context

fluent-bit configuration

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: default
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config-ds
  namespace: default
  labels:
    k8s-app: fluent-bit
data:
  # Configuration files: server, input, filters and output
  # ======================================================
  # 

  fluent-bit.conf: |
    [SERVICE]
        Flush                     1
        Log_Level                 info
        Daemon                    off
        Parsers_File              parsers.conf
        HTTP_Server               On
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 500M

    @INCLUDE input-kubernetes.conf
    @INCLUDE output-stdout.conf

  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/plaintext-*.log
        DB                /var/log/flb_kube.db
        Parser            docker
        Mem_Buf_Limit     500MB
        storage.type      filesystem
        Skip_Long_Lines   On
        Refresh_Interval  1
        Rotate_Wait       10

  output-stdout.conf: |
    [OUTPUT]
        Name                          stdout
        Match                         kube.*
        Format                        json_lines
        json_date_format              iso8601

  parsers.conf: |
    [PARSER]
        Name          stdout
        Format        json
        Time_Key      time
        Time_Format   %Y-%m-%dT%H:%M:%S.%L
        Time_Keep     On
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fl-exp
  namespace: default
  labels:
    k8s-app: fluent-bit-logging
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: fluent-bit-logging
  template:
    metadata:
      labels:
        k8s-app: fluent-bit-logging
        version: v1
        kubernetes.io/cluster-service: "true"
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "2020"
        prometheus.io/path: /api/v1/metrics/prometheus
    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:1.4.2
        imagePullPolicy: IfNotPresent
        ports:
          - containerPort: 2020
        resources:
          requests:
            cpu: 1000m
            memory: 1024Mi
          limits:
            cpu: 1000m
            memory: 1024Mi
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config-ds
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"

Metrics from one of the run

# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="storage_backlog.1"} 0 1587153190798
fluentbit_input_bytes_total{name="tail.0"} 109649569 1587153190798
# TYPE fluentbit_input_files_closed_total counter
fluentbit_input_files_closed_total{name="tail.0"} 1 1587153190798
# TYPE fluentbit_input_files_opened_total counter
fluentbit_input_files_opened_total{name="tail.0"} 2 1587153190798
# TYPE fluentbit_input_files_rotated_total counter
fluentbit_input_files_rotated_total{name="tail.0"} 0 1587153190798
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="storage_backlog.1"} 0 1587153190798
fluentbit_input_records_total{name="tail.0"} 567607 1587153190798
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0 1587153190798
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 109649569 1587153190798
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 567607 1587153190798
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1587153190798
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0 1587153190798
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1587151599

The plugin is not able keep up if fluentbit_input_bytes_total rate goes beyond 100 Kb per second. used the query rate(fluentbit_input_bytes_total{name="tail.0"}[5m])

sneben commented 4 years ago

We encounter the same behaviour. If logs are written under high load to STDOUT fluent-bit stops working on docker log rotation and resume on the next rotation. So for one rotation intervall it works, for the next interval not and so on.

Further we did some small tests, which leads us to say the problem occurs between 1000-1250 loglines per second per node. The log size seem not to matter, because we tested with 132 and 446 characters, with the same behaviour.

Environment:

Tests

It seems that ~1000msg/s is valid for 1 host, regardless of the count of the pods, which produce that logs. Further we played around with the options for the tail plugin in all combination with no effect.

edsiper commented 4 years ago

how often rotation is happening ?, what are the chances the file get rotated in between a refresh_interval ?

sneben commented 4 years ago

I did 3 tests. The first 2 with 32MB rotation size:

And one with 128MB rotation size:

I set up Refresh_Interval to 1s for some tests as well, which had no effect.

thepythonista commented 4 years ago

Please note this happens only if the tail plugin is configured to tail files from /var/log/containers which have symlinks.

In one of my other tests runs in minikube, docker daeamon.json was configured to rotate every 50mb and keep the latest 5 files. 1000000 lines were generated which took more than 10 seconds and my refresh interval was 1 second.

ls -lh --time-style=full-iso 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b/
-rw-r----- 1 root root  25M 2020-04-24 19:46:31.559009075 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log
-rw-r----- 1 root root  48M 2020-04-24 19:46:16.625763749 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.1
-rw-r----- 1 root root  48M 2020-04-24 19:45:49.753334243 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.2
-rw-r----- 1 root root  48M 2020-04-24 19:45:19.867398737 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3

There is a reasonable gap of time between when the files are rotated. Only the contents of 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3 were accessible. The rest of the files were not tailed.

Input Configuration:

[INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/plaintext-*.log
        DB                /var/log/flb_kube.db
        Parser            docker
        Mem_Buf_Limit     500MB
        Skip_Long_Lines   On
        Refresh_Interval  1
        Rotate_Wait       10
v-kay commented 4 years ago

Hey @thepythonista, @sneben ,

I am facing the similar kinda issue where running fluent bit to print these logs to stdout of k8 pod and have noticed if files are completed i mean application is done writing logs to the files then fluent bit parse the complete file and print it to stdout but if application is still writing logs to the file then fluent bit reads upto few lines only in one go when it scans the path, after that it never prints the logs which are newly added in the file.

is this related to the above explanation where pod is writing more than 1000msg/s ?

Any suggestion ?

rashmichandrashekar commented 3 years ago

I am seeing the same issue under high scale. Any workarounds/updates?

fkoehler commented 3 years ago

We are experiencing the same issue.

I send in 200k log messages, We are using docker in kubernets. The docker logrotation rotates every 50 MB with 5 total files max. I have increased Rotate_Wait to 240 and Refresh_Interval is 10.

Some rotation of files are detected, others not. Later the refresh finds rotated files which also feels wrong because all of them were rotated.

I think the problem could be that docker rotates it by renaming all other files. So when a file is rotated it get index 1 and all other files are renamed where the index is increased by one.Also fluentbit should not try to find these new files since they were rotated.

Not sure how to reproduce yet in a simple manner.

dewaldmienie commented 3 years ago

I have a similar issue (not sure if it is related or a different issue).

I have a PoC with low volumes (200 events), but Fluent Bit's tail input only fetches the first few log entries in each log file and then hangs. Then, when I restart Fluent Bit, it fetches the rest of the logs from the files. See this stack overflow entry for more details.

ganga1980 commented 3 years ago

/watch

pboennig commented 3 years ago

I am also running into a similar issue in which log files that are appended to are not captured by tail and sent to ElasticSearch. Decreasing the refresh_interval if anything makes things worse, as td-agent-bit sees the file before anything is added to it and then promptly ignores any future updates.

As anyone in this thread found a workaround?

pboennig commented 3 years ago

For my specific issue, I found it was actually a case of #3111, as my logs were stored in AWS EFS. Moving the logs outside of EFS solved the issue.

wlawton commented 3 years ago

I've got the same problem as described by @thepythonista. We only want fluentbit to forward our application specific logs so have a tail path configured like this:

/var/log/containers/*_our-app-*.log". 

The /var/log/containers dir contains a symlink for each container e.g. xyz-blue-6bc9c664f4-sg8b4_our-app-blue_xyz-blue-5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb.log

..which points to another symlink e.g. /var/log/pods/our-app-blue_xyz-blue-6bc9c664f4-sg8b4_9b5687c0-d62c-4e82-af28-c032069eb84c/xyz-blue/0.log

which points to the container log file currently being written to e.g. /var/lib/docker/containers/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb-json.log.

When docker (or json-file) rotates the log file it suffixes the current log file name with .1 and renames the previous .1 file to .2 etc. I'm wondering if the chain of symlinks is preventing fluentbit from accessing or being aware of the .log.1 file. Under high load we can see that the logs at the end of the .log.1, *.log.2 etc files have not been forwarded. Assumption is that fluentbit has fallen behind reading logs from the current log file, so when the file is renamed, fluentbit can no longer access it or is unaware of it due to the use of symlinks that only direct it to the current log file.

For what its worth, we are using fluent-bit image 1.7.4, here is my service and output config:

    [SERVICE]
        Flush 1
        Daemon Off
        Log_Level debug
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        #
        # HTTP server for providing metrics endpoint
        #
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port {{ .Values.service.port }}
        storage.path /var/log/flb-storage/
        storage.sync normal
        storage.checksum off
        storage.backlog.mem.limit 20M
     [INPUT]
        Name             tail
        Path             "/var/log/containers/*_our-app-*.log"
        Exclude_Path     "fluent*.log"
        Parser           docker
        Tag_Regex        (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(? <namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Tag              ourcompany.k8s.<namespace_name>.<container_name>.<pod_name>.<docker_id>-
        Refresh_Interval 5
        Mem_Buf_Limit    500MB
        Skip_Long_Lines  On
        DB               /tmp/tail-containers-state.db
        DB.Sync          Normal
        Docker_Mode      On
        storage.type     filesystem
        Rotate_Wait      15
srideok2 commented 2 years ago

Hi, Is there any specific solution for this issue. I am also facing similar issue where fluentbit is trying to scrape logs from mounted nfs dir. If I restart fluentbit, It will tail the logs again but stops after few mins.

AlessioCasco commented 2 years ago

Hello

We have the same issue on 1.8.13

srideok2 commented 2 years ago

@AlessioCasco , For me, this is resolved after setting Inotify watcher to false. check this link for more details

        [INPUT]
            Name                   tail
            Inotify_Watcher   false
Himenon commented 2 years ago

I have almost the same problem. What should the fluent-bit settings be for high loads?

maheskrishnan commented 2 years ago

Yes, even for moderate (around 400 logs per sec) loads this happens. Anybody could help here?

miksercz commented 2 years ago

We are also affected by this issue for pods as soon as they start logging over 1k messages per second. As soon as the load calms down, log messages are sent as expected, with 10MB docker rotation and following input conf:

[INPUT]
    Name                tail
    Tag                 kube.<namespace_name>.<pod_name>_<container_name>-<docker_id>
    Tag_Regex           ^/var/log/containers/(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
    Path                /var/log/containers/*.log
    Parser              docker
    DB                  /var/log/flb_kube.db
    Mem_Buf_Limit       10MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Docker_Mode         On
github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

aslafy-z commented 2 years ago

Not stale

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

aslafy-z commented 2 years ago

No stale

patrick-stephens commented 1 year ago

I wonder if it may be related to https://github.com/fluent/fluent-bit/pull/6552

pwhelan commented 1 year ago

I wonder if it may be related to #6552

Any chance you could take a look at it @lecaros or give your perspective here? I'll also take a deeper look later. Any chance you can test again @thepythonista to see if it still happens with the latest version?

patrick-stephens commented 1 year ago

@pwhelan I don't think this is in a release yet but @thepythonista you can test with the latest version of master container built overnight: ghcr.io/fluent/fluent-bit/master:x86_64 (don't use in production obviously).

lecaros commented 1 year ago

Agree with @patrick-stephens. Please test this with the latest image (this was initially reported in 1.4.2) and share the repro in case you still see the issue.

romanfurst commented 1 year ago

We are hitting the same issue using fluent-bit 2.0.9 on kubernetes 1.24 .

Originally our configuration was:

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Tag kube.*
        multiline.parser cri
        Mem_Buf_Limit 200MB
        Skip_Long_Lines Off
        DB /var/log/flb-storage/offset.db
        Buffer_Max_Size 1M

but we end up whith this which little bit helps to reduce the issue:

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        Tag kube.*
        multiline.parser cri
        Mem_Buf_Limit 200MB
        Skip_Long_Lines Off
        DB /var/log/flb-storage/offset.db
        Buffer_Max_Size 1M
        Progress_Check_Interval 1ms
        Inotify_Watcher false
        Rotate_Wait 15
        Refresh_Interval 1
        Buffer_Chunk_Size 2k





We made a small test case which might helps pin down the issue:


At first we got our cluster node under little stress to generete some amount of logs traffic on it. Then we deployed our control container which emit sample logline every 10ms with unique incrementing ID. So at the end we could tell exactly which logs are missing. After the test run has finished we simply checked content of both '0.log' (current) and '0.log. <timestamp>' (last rotated) files on FS for our control container. The '0.log. <timestamp>' file end up with log number "76812" and '0.log' starts with log number "76813". Until now everything seems fine, cri runtime rotate logs correctly. But if we comapare it with elastic there are records between 76803 - 76825 missing. Assuming everytime log file is rotated fluen-bit not process some amount of records (at the end and at the beginning of rotated file). In our test context for the control container it result to only about 74000 of total 80000 logs has been delivered (6000 lost).

patrick-stephens commented 1 year ago

2.0.10 was out as well last week which has some further improvements in it. Threaded input is now possible as well I believe: threaded on

romanfurst commented 1 year ago

with threaded on we are hiting 7071 . By the way, right now we are running on v2.1.2, and so far, it looks like there is no improvement. The issue with quick file rotation seems to be still relevant.

liguangcheng commented 1 year ago

Logs may be lost under high load. Fluent relies on inotify to passively receive file changes, and each notify fluent can only read a few lines. Therefore, under high load, the data source writes files to cause file changes, and fluent only processes a few lines after receiving the notification. Obviously, it cannot keep up with the writing speed.

kramreddy20 commented 1 year ago

Here is the snippet with file size and inode numbers with a sleep time of 2 mins (custom pod which writes logs at every 2m) b/w [A] and [B]

total 27M--------->[A] 131256242 -rw-r----- 1 root root 15M Aug 10 08:20 0.log.20230810-082034 131256244 -rw-r----- 1 root root 8.8M Aug 10 08:24 0.log 93354974 drwxr-xr-x 4 root root 44 Aug 10 08:14 .. 131256234 drwxr-xr-x 2 root root 48 Aug 10 08:20 . Thu Aug 10 08:25:53 UTC 2023 ===========sleep 2m=============== total 14M--------->[B] 131256244 -rw-r----- 1 root root 13M Aug 10 08:26 0.log.20230810-082646 131256245 -rw-r--r-- 1 root root 131K Aug 10 08:26 0.log.20230810-082034.gz 131256242 -rw-r----- 1 root root 575K Aug 10 08:26 0.log 93354974 drwxr-xr-x 4 root root 44 Aug 10 08:14 .. 131256234 drwxr-xr-x 2 root root 80 Aug 10 08:26 . Thu Aug 10 08:27:53 UTC 2023

Here are the fluentbit pod logs

[2023/08/10 08:26:46] [debug] [input:tail:tail.0] inode=131256244 rotated IN_MOVE SELF '/var/log/containers/ehl-logging-cloudwatch-proxy-0_edison-system_aws-boto-client-9cca8a79493ebe0b085f68aaa58a0e32db3df33353d936b4356c5c48e4d8386f.log' [2023/08/10 08:26:46] [debug] [input:tail:tail.0] inode=131256244 rotated /var/log/containers/ehl-logging-cloudwatch-proxy-0_edison-system_aws-boto-client-9cca8a79493ebe0b085f68aaa58a0e32db3df33353d936b4356c5c48e4d8386f.log -> /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:46] [ info] [input:tail:tail.0] inode=131256244 handle rotation(): /var/log/containers/ehl-logging-cloudwatch-proxy-0_edison-system_aws-boto-client-9cca8a79493ebe0b085f68aaa58a0e32db3df33353d936b4356c5c48e4d8386f.log => /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:46] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=131256244 watch_fd=98 [2023/08/10 08:26:46] [ info] [input:tail:tail.0] inotify_fs_add(): inode=131256244 watch_fd=99 name=/var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646

[2023/08/10 08:26:54] [debug] [input chunk] update output instances with new chunk size diff=10594 [2023/08/10 08:26:55] [debug] [input:tail:tail.0] inode=131256244 purge rotated file /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 (offset=13180488 / size = 13180488) [2023/08/10 08:26:55] [debug] [input:tail:tail.0] inode=131256244 removing file name /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:55] [debug] [input:tail:tail.0] db: file deleted from database: /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:55] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=131256244 watch_fd=99 [2023/08/10 08:26:55] [debug] [input:tail:tail.0] inode=44146728 events: IN_MODIFY

=================================

From the above logs its is clear that fluentbit removes the inode number of the file that was rotated Logs stop at the exact moment at which inode number of the already already achived file is assigned to 0.log [B] If at all the 0.log is assigned a new inode number (after rotation) then fluentbit would again pick the file and send the logs to the destination is what I am expecting

marlovil commented 1 year ago

Same problem here. In our case, with these settings we've solved the issue:

        Buffer_Max_Size 1M
        Refresh_Interval 1
        Rotate_Wait      15
        Inotify_Watcher false
liguangcheng commented 1 year ago

The default value of Buffer_Chunk_Size and Buffer_Max_Size is 32K, if you find log missing under heavy load, you can try to increase value of these two parameters.

After going through the code of fluent bit, I found these two parameters take effect per monitored file. if these two paratemers' values too small while the traffic is high then the files data read rate cannot catch up the traffic rate, to a certain extent, it will cause the log to be lost.

So, according to the traffic rate, increase these two parameters value, fluent bit can consume more file content in buffer per file. Since fluent bit sends data to the outside in batches, therefore, the larger the buffer, the greater the sending capacity.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

metost commented 10 months ago

Not stale

Aslan-Liu commented 9 months ago

Doesn't anyone think this is the basic fuction for fluentbit? Why this big issue can be there for 4 years?

liguangcheng commented 8 months ago

@Aslan-Liu

Doesn't anyone think this is the basic fuction for fluentbit? Why this big issue can be there for 4 years? I think everyone's actual environment is different, and the reasons for log loss may be different. The loss problem also occurred in my environment before. When I increased the two configurations of Buffer_Chunk_Size and Buffer_Max_Size, file loss almost no longer occurs. You You can try changing these two values and then verify.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

metost commented 5 months ago

Not stale

edsiper commented 3 months ago

I've verified with our partners (cloud providers) and they are not seeing this issue even at a high load. does anyone have a reproducible case ?

jarmd commented 2 months ago

We have an IIS server producing around 40MB of log every hour which rotates once every hour. This makes fluent-bit 3.1.3 stop logging The only thing we are seeing in the logs of fluent-bit is the following log messages :

[2024/08/30 08:57:39] [ warn] [engine] failed to flush chunk '9156-1724942194.113089100.flb', retry in 629 seconds: task_id=511, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:39] [ warn] [engine] failed to flush chunk '9156-1724941916.543275700.flb', retry in 1604 seconds: task_id=337, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:40] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:40] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:40] [ warn] [engine] failed to flush chunk '9156-1724943257.254868000.flb', retry in 1981 seconds: task_id=1545, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:40] [ warn] [engine] failed to flush chunk '9156-1724941699.294542600.flb', retry in 557 seconds: task_id=123, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:41] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:41] [ warn] [engine] failed to flush chunk '9156-1724942745.232207800.flb', retry in 1296 seconds: task_id=1042, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:42] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:42] [ warn] [engine] failed to flush chunk '9156-1724943111.488111100.flb', retry in 1172 seconds: task_id=1400, input=tail.0 > output=forward.0 (out_id=0)

This is the only messages we have to go after. And we see nothing else ! Also we tried several tuning, but with no luck

Fluent-bit 3.1.3 (On windows) uses forward protocol to send messages to a fluentd instance running on kubernetes. Let me know if you need more info regarding this.

lecaros commented 2 months ago

Hi @jarmd, can you share repro steps? Also, I'd expect to see some other errors previous to the ones you shared in that log excerpt.

jarmd commented 2 months ago

Hi @jarmd, can you share repro steps? Also, I'd expect to see some other errors previous to the ones you shared in that log excerpt.

Hi Sure thing I have shared the fluentbit config and logs from fluentbit as attached file here. It's actually quite easy to reproduce.

  1. Have a busy webpage running on Windows IIS logging between 39MB and 41MB each hour
  2. It logs to a folder where fluentbit tails all *.log files
  3. IIS rotates the logs once every hour, the format looks like:

image

I'm unfortunately not allowed to share the log files generated by the DispatchAPI on public site, due to company security rules. But the log files includes status code of the DispatchAPI calls and stack traces of errors etc. what a IIS server now generates of logs.

In the current running fluentbit setup we have change the configuration of: Mem_Buf_Limit 100MB Ignore_Older 2h Refresh_Interval 1 Buffer_Max_Size 40m Buffer_Chunk_Size 40m

These values was previously configured to: Mem_Buf_Limit 1000MB Refresh_Interval 10 Ignore_Older 90d Buffer_Chunk_Size 1Mb Buffer_Max_Size 5Mb

For now it has been running for 24 hours with no interruptions. But lets see what will happen sometimes it takes a few days before it stops sending data !!

Attached fluentbit conf file and log file: fluent-bit-win-service-config.err 1.log fluent-bit-updated.conf.zip

Best regards Jan

jarmd commented 2 months ago

The logs of fluentbit is after 4 - 6 hours of run starting to complain about the [2024/08/30 08:57:42] [error] [output:forward:forward.0] could not write forward header

And then it keeps spamming. Logs still flows though, but question is if more tuning on fluentbit side is needed or on fluentd side. I see no error logs on fluentd side !! I have attached the latest log file from fluentbit where we can see when the issue started and how it keeps spamming the fluentbit logs: fluent-bit-win-service-config_03-09.err.zip

jarmd commented 2 months ago

We configured fluentbit to generate debug logs. We are seeing the following happening on newest version 3.1.6: [2024/09/11 06:53:35] [debug] [upstream] KA connection #2596 to 10.76.5.60:24224 has been assigned (recycled) [2024/09/11 06:53:35] [error] [output:forward:forward.0] could not write forward header [2024/09/11 06:53:35] [debug] [upstream] KA connection #2596 to 10.76.5.60:24224 is now available

I added a snippet of the logs showing this. fluent-bit-win-service-config.err.log-debug.zip

Any idea of how to fix this error message ?