Open thepythonista opened 4 years ago
We encounter the same behaviour. If logs are written under high load to STDOUT fluent-bit stops working on docker log rotation and resume on the next rotation. So for one rotation intervall it works, for the next interval not and so on.
Further we did some small tests, which leads us to say the problem occurs between 1000-1250 loglines per second per node. The log size seem not to matter, because we tested with 132 and 446 characters, with the same behaviour.
Environment:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_apps.db
Mem_Buf_Limit 64MB
Skip_Long_Lines On
Refresh_Interval 10
Tests
It seems that ~1000msg/s is valid for 1 host, regardless of the count of the pods, which produce that logs. Further we played around with the options for the tail plugin in all combination with no effect.
how often rotation is happening ?, what are the chances the file get rotated in between a refresh_interval ?
I did 3 tests. The first 2 with 32MB rotation size:
And one with 128MB rotation size:
I set up Refresh_Interval to 1s for some tests as well, which had no effect.
Please note this happens only if the tail plugin is configured to tail files from /var/log/containers which have symlinks.
In one of my other tests runs in minikube, docker daeamon.json was configured to rotate every 50mb and keep the latest 5 files. 1000000 lines were generated which took more than 10 seconds and my refresh interval was 1 second.
ls -lh --time-style=full-iso 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b/
-rw-r----- 1 root root 25M 2020-04-24 19:46:31.559009075 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log
-rw-r----- 1 root root 48M 2020-04-24 19:46:16.625763749 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.1
-rw-r----- 1 root root 48M 2020-04-24 19:45:49.753334243 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.2
-rw-r----- 1 root root 48M 2020-04-24 19:45:19.867398737 +0000 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3
There is a reasonable gap of time between when the files are rotated. Only the contents of 842e15a7c29423e29ce420bf185e547ab5adcf7d1bd7758eb87fca46b736094b-json.log.3 were accessible. The rest of the files were not tailed.
Input Configuration:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/plaintext-*.log
DB /var/log/flb_kube.db
Parser docker
Mem_Buf_Limit 500MB
Skip_Long_Lines On
Refresh_Interval 1
Rotate_Wait 10
Hey @thepythonista, @sneben ,
I am facing the similar kinda issue where running fluent bit to print these logs to stdout of k8 pod and have noticed if files are completed i mean application is done writing logs to the files then fluent bit parse the complete file and print it to stdout but if application is still writing logs to the file then fluent bit reads upto few lines only in one go when it scans the path, after that it never prints the logs which are newly added in the file.
is this related to the above explanation where pod is writing more than 1000msg/s ?
Any suggestion ?
I am seeing the same issue under high scale. Any workarounds/updates?
We are experiencing the same issue.
I send in 200k log messages, We are using docker in kubernets. The docker logrotation rotates every 50 MB with 5 total files max. I have increased Rotate_Wait to 240 and Refresh_Interval is 10.
Some rotation of files are detected, others not. Later the refresh finds rotated files which also feels wrong because all of them were rotated.
I think the problem could be that docker rotates it by renaming all other files. So when a file is rotated it get index 1 and all other files are renamed where the index is increased by one.Also fluentbit should not try to find these new files since they were rotated.
Not sure how to reproduce yet in a simple manner.
I have a similar issue (not sure if it is related or a different issue).
I have a PoC with low volumes (200 events), but Fluent Bit's tail input only fetches the first few log entries in each log file and then hangs. Then, when I restart Fluent Bit, it fetches the rest of the logs from the files. See this stack overflow entry for more details.
/watch
I am also running into a similar issue in which log files that are appended to are not captured by tail
and sent to ElasticSearch. Decreasing the refresh_interval
if anything makes things worse, as td-agent-bit
sees the file before anything is added to it and then promptly ignores any future updates.
As anyone in this thread found a workaround?
For my specific issue, I found it was actually a case of #3111, as my logs were stored in AWS EFS. Moving the logs outside of EFS solved the issue.
I've got the same problem as described by @thepythonista. We only want fluentbit to forward our application specific logs so have a tail path configured like this:
/var/log/containers/*_our-app-*.log".
The /var/log/containers dir contains a symlink for each container e.g. xyz-blue-6bc9c664f4-sg8b4_our-app-blue_xyz-blue-5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb.log
..which points to another symlink e.g. /var/log/pods/our-app-blue_xyz-blue-6bc9c664f4-sg8b4_9b5687c0-d62c-4e82-af28-c032069eb84c/xyz-blue/0.log
which points to the container log file currently being written to e.g. /var/lib/docker/containers/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb/5187b73bf7806eb02e9c18b4f8ff9a4fe638d70940e148acb9fab24c30f403cb-json.log.
When docker (or json-file) rotates the log file it suffixes the current log file name with .1 and renames the previous .1 file to .2 etc. I'm wondering if the chain of symlinks is preventing fluentbit from accessing or being aware of the .log.1 file. Under high load we can see that the logs at the end of the .log.1, *.log.2 etc files have not been forwarded. Assumption is that fluentbit has fallen behind reading logs from the current log file, so when the file is renamed, fluentbit can no longer access it or is unaware of it due to the use of symlinks that only direct it to the current log file.
For what its worth, we are using fluent-bit image 1.7.4, here is my service and output config:
[SERVICE]
Flush 1
Daemon Off
Log_Level debug
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
#
# HTTP server for providing metrics endpoint
#
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.service.port }}
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem.limit 20M
[INPUT]
Name tail
Path "/var/log/containers/*_our-app-*.log"
Exclude_Path "fluent*.log"
Parser docker
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(? <namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Tag ourcompany.k8s.<namespace_name>.<container_name>.<pod_name>.<docker_id>-
Refresh_Interval 5
Mem_Buf_Limit 500MB
Skip_Long_Lines On
DB /tmp/tail-containers-state.db
DB.Sync Normal
Docker_Mode On
storage.type filesystem
Rotate_Wait 15
Hi, Is there any specific solution for this issue. I am also facing similar issue where fluentbit is trying to scrape logs from mounted nfs dir. If I restart fluentbit, It will tail the logs again but stops after few mins.
Hello
We have the same issue on 1.8.13
@AlessioCasco , For me, this is resolved after setting Inotify watcher to false. check this link for more details
[INPUT]
Name tail
Inotify_Watcher false
I have almost the same problem. What should the fluent-bit settings be for high loads?
Yes, even for moderate (around 400 logs per sec) loads this happens. Anybody could help here?
We are also affected by this issue for pods as soon as they start logging over 1k messages per second. As soon as the load calms down, log messages are sent as expected, with 10MB docker rotation and following input conf:
[INPUT]
Name tail
Tag kube.<namespace_name>.<pod_name>_<container_name>-<docker_id>
Tag_Regex ^/var/log/containers/(?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 10MB
Skip_Long_Lines On
Refresh_Interval 10
Docker_Mode On
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Not stale
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
No stale
I wonder if it may be related to https://github.com/fluent/fluent-bit/pull/6552
I wonder if it may be related to #6552
Any chance you could take a look at it @lecaros or give your perspective here? I'll also take a deeper look later. Any chance you can test again @thepythonista to see if it still happens with the latest version?
@pwhelan I don't think this is in a release yet but @thepythonista you can test with the latest version of master
container built overnight: ghcr.io/fluent/fluent-bit/master:x86_64 (don't use in production obviously).
Agree with @patrick-stephens. Please test this with the latest image (this was initially reported in 1.4.2) and share the repro in case you still see the issue.
We are hitting the same issue using fluent-bit 2.0.9 on kubernetes 1.24 .
Originally our configuration was:
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kube.*
multiline.parser cri
Mem_Buf_Limit 200MB
Skip_Long_Lines Off
DB /var/log/flb-storage/offset.db
Buffer_Max_Size 1M
but we end up whith this which little bit helps to reduce the issue:
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kube.*
multiline.parser cri
Mem_Buf_Limit 200MB
Skip_Long_Lines Off
DB /var/log/flb-storage/offset.db
Buffer_Max_Size 1M
Progress_Check_Interval 1ms
Inotify_Watcher false
Rotate_Wait 15
Refresh_Interval 1
Buffer_Chunk_Size 2k
We made a small test case which might helps pin down the issue:
At first we got our cluster node under little stress to generete some amount of logs traffic on it. Then we deployed our control container which emit sample logline every 10ms with unique incrementing ID. So at the end we could tell exactly which logs are missing.
After the test run has finished we simply checked content of both '0.log' (current) and '0.log. <timestamp>' (last rotated) files on FS for our control container. The '0.log. <timestamp>' file end up with log number "76812" and '0.log' starts with log number "76813". Until now everything seems fine, cri runtime rotate logs correctly. But if we comapare it with elastic there are records between 76803 - 76825 missing. Assuming everytime log file is rotated fluen-bit not process some amount of records (at the end and at the beginning of rotated file). In our test context for the control container it result to only about 74000 of total 80000 logs has been delivered (6000 lost).
2.0.10 was out as well last week which has some further improvements in it.
Threaded input is now possible as well I believe: threaded on
with threaded on
we are hiting 7071 . By the way, right now we are running on v2.1.2, and so far, it looks like there is no improvement. The issue with quick file rotation seems to be still relevant.
Logs may be lost under high load. Fluent relies on inotify to passively receive file changes, and each notify fluent can only read a few lines. Therefore, under high load, the data source writes files to cause file changes, and fluent only processes a few lines after receiving the notification. Obviously, it cannot keep up with the writing speed.
Here is the snippet with file size and inode numbers with a sleep time of 2 mins (custom pod which writes logs at every 2m) b/w [A] and [B]
total 27M--------->[A] 131256242 -rw-r----- 1 root root 15M Aug 10 08:20 0.log.20230810-082034 131256244 -rw-r----- 1 root root 8.8M Aug 10 08:24 0.log 93354974 drwxr-xr-x 4 root root 44 Aug 10 08:14 .. 131256234 drwxr-xr-x 2 root root 48 Aug 10 08:20 . Thu Aug 10 08:25:53 UTC 2023 ===========sleep 2m=============== total 14M--------->[B] 131256244 -rw-r----- 1 root root 13M Aug 10 08:26 0.log.20230810-082646 131256245 -rw-r--r-- 1 root root 131K Aug 10 08:26 0.log.20230810-082034.gz 131256242 -rw-r----- 1 root root 575K Aug 10 08:26 0.log 93354974 drwxr-xr-x 4 root root 44 Aug 10 08:14 .. 131256234 drwxr-xr-x 2 root root 80 Aug 10 08:26 . Thu Aug 10 08:27:53 UTC 2023
[2023/08/10 08:26:46] [debug] [input:tail:tail.0] inode=131256244 rotated IN_MOVE SELF '/var/log/containers/ehl-logging-cloudwatch-proxy-0_edison-system_aws-boto-client-9cca8a79493ebe0b085f68aaa58a0e32db3df33353d936b4356c5c48e4d8386f.log' [2023/08/10 08:26:46] [debug] [input:tail:tail.0] inode=131256244 rotated /var/log/containers/ehl-logging-cloudwatch-proxy-0_edison-system_aws-boto-client-9cca8a79493ebe0b085f68aaa58a0e32db3df33353d936b4356c5c48e4d8386f.log -> /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:46] [ info] [input:tail:tail.0] inode=131256244 handle rotation(): /var/log/containers/ehl-logging-cloudwatch-proxy-0_edison-system_aws-boto-client-9cca8a79493ebe0b085f68aaa58a0e32db3df33353d936b4356c5c48e4d8386f.log => /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:46] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=131256244 watch_fd=98 [2023/08/10 08:26:46] [ info] [input:tail:tail.0] inotify_fs_add(): inode=131256244 watch_fd=99 name=/var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646
[2023/08/10 08:26:54] [debug] [input chunk] update output instances with new chunk size diff=10594 [2023/08/10 08:26:55] [debug] [input:tail:tail.0] inode=131256244 purge rotated file /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 (offset=13180488 / size = 13180488) [2023/08/10 08:26:55] [debug] [input:tail:tail.0] inode=131256244 removing file name /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:55] [debug] [input:tail:tail.0] db: file deleted from database: /var/log/pods/edison-system_ehl-logging-cloudwatch-proxy-0_e50a7d92-fc78-40e9-9a09-1a5e246884be/aws-boto-client/0.log.20230810-082646 [2023/08/10 08:26:55] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=131256244 watch_fd=99 [2023/08/10 08:26:55] [debug] [input:tail:tail.0] inode=44146728 events: IN_MODIFY
=================================
From the above logs its is clear that fluentbit removes the inode number of the file that was rotated Logs stop at the exact moment at which inode number of the already already achived file is assigned to 0.log [B] If at all the 0.log is assigned a new inode number (after rotation) then fluentbit would again pick the file and send the logs to the destination is what I am expecting
Same problem here. In our case, with these settings we've solved the issue:
Buffer_Max_Size 1M
Refresh_Interval 1
Rotate_Wait 15
Inotify_Watcher false
The default value of Buffer_Chunk_Size and Buffer_Max_Size is 32K, if you find log missing under heavy load, you can try to increase value of these two parameters.
After going through the code of fluent bit, I found these two parameters take effect per monitored file. if these two paratemers' values too small while the traffic is high then the files data read rate cannot catch up the traffic rate, to a certain extent, it will cause the log to be lost.
So, according to the traffic rate, increase these two parameters value, fluent bit can consume more file content in buffer per file. Since fluent bit sends data to the outside in batches, therefore, the larger the buffer, the greater the sending capacity.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Not stale
Doesn't anyone think this is the basic fuction for fluentbit? Why this big issue can be there for 4 years?
@Aslan-Liu
Doesn't anyone think this is the basic fuction for fluentbit? Why this big issue can be there for 4 years? I think everyone's actual environment is different, and the reasons for log loss may be different. The loss problem also occurred in my environment before. When I increased the two configurations of Buffer_Chunk_Size and Buffer_Max_Size, file loss almost no longer occurs. You You can try changing these two values and then verify.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Not stale
I've verified with our partners (cloud providers) and they are not seeing this issue even at a high load. does anyone have a reproducible case ?
We have an IIS server producing around 40MB of log every hour which rotates once every hour. This makes fluent-bit 3.1.3 stop logging The only thing we are seeing in the logs of fluent-bit is the following log messages :
[2024/08/30 08:57:39] [ warn] [engine] failed to flush chunk '9156-1724942194.113089100.flb', retry in 629 seconds: task_id=511, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:39] [ warn] [engine] failed to flush chunk '9156-1724941916.543275700.flb', retry in 1604 seconds: task_id=337, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:40] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:40] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:40] [ warn] [engine] failed to flush chunk '9156-1724943257.254868000.flb', retry in 1981 seconds: task_id=1545, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:40] [ warn] [engine] failed to flush chunk '9156-1724941699.294542600.flb', retry in 557 seconds: task_id=123, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:41] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:41] [ warn] [engine] failed to flush chunk '9156-1724942745.232207800.flb', retry in 1296 seconds: task_id=1042, input=tail.0 > output=forward.0 (out_id=0) [2024/08/30 08:57:42] [error] [output:forward:forward.0] could not write forward header [2024/08/30 08:57:42] [ warn] [engine] failed to flush chunk '9156-1724943111.488111100.flb', retry in 1172 seconds: task_id=1400, input=tail.0 > output=forward.0 (out_id=0)
This is the only messages we have to go after. And we see nothing else ! Also we tried several tuning, but with no luck
Fluent-bit 3.1.3 (On windows) uses forward protocol to send messages to a fluentd instance running on kubernetes. Let me know if you need more info regarding this.
Hi @jarmd, can you share repro steps? Also, I'd expect to see some other errors previous to the ones you shared in that log excerpt.
Hi @jarmd, can you share repro steps? Also, I'd expect to see some other errors previous to the ones you shared in that log excerpt.
Hi Sure thing I have shared the fluentbit config and logs from fluentbit as attached file here. It's actually quite easy to reproduce.
I'm unfortunately not allowed to share the log files generated by the DispatchAPI on public site, due to company security rules. But the log files includes status code of the DispatchAPI calls and stack traces of errors etc. what a IIS server now generates of logs.
In the current running fluentbit setup we have change the configuration of: Mem_Buf_Limit 100MB Ignore_Older 2h Refresh_Interval 1 Buffer_Max_Size 40m Buffer_Chunk_Size 40m
These values was previously configured to: Mem_Buf_Limit 1000MB Refresh_Interval 10 Ignore_Older 90d Buffer_Chunk_Size 1Mb Buffer_Max_Size 5Mb
For now it has been running for 24 hours with no interruptions. But lets see what will happen sometimes it takes a few days before it stops sending data !!
Attached fluentbit conf file and log file: fluent-bit-win-service-config.err 1.log fluent-bit-updated.conf.zip
Best regards Jan
The logs of fluentbit is after 4 - 6 hours of run starting to complain about the [2024/08/30 08:57:42] [error] [output:forward:forward.0] could not write forward header
And then it keeps spamming. Logs still flows though, but question is if more tuning on fluentbit side is needed or on fluentd side. I see no error logs on fluentd side !! I have attached the latest log file from fluentbit where we can see when the issue started and how it keeps spamming the fluentbit logs: fluent-bit-win-service-config_03-09.err.zip
We configured fluentbit to generate debug logs. We are seeing the following happening on newest version 3.1.6: [2024/09/11 06:53:35] [debug] [upstream] KA connection #2596 to 10.76.5.60:24224 has been assigned (recycled) [2024/09/11 06:53:35] [error] [output:forward:forward.0] could not write forward header [2024/09/11 06:53:35] [debug] [upstream] KA connection #2596 to 10.76.5.60:24224 is now available
I added a snippet of the logs showing this. fluent-bit-win-service-config.err.log-debug.zip
Any idea of how to fix this error message ?
Bug Report
Describe the bug Tail input plugin not able to tail files when the file rotation happens. Used a container that generates 1,000,000 lines that log it to stdout. configured fluent-bit to tail the logs files and print it to standard output. The tail plugin is only able to get ~200000 lines (only from the first file). The plugin is not able to fetch lines after the rotation happens under heavy load.
To Reproduce
Expected behavior Generate 1,000,000 logs lines and log it to standard output and expect fluent-bit to tail and fetch all those 1,000,000 lines.
Your Environment
Additional context
fluent-bit configuration
Metrics from one of the run
The plugin is not able keep up if fluentbit_input_bytes_total rate goes beyond 100 Kb per second. used the query rate(fluentbit_input_bytes_total{name="tail.0"}[5m])