Closed panaji closed 3 months ago
any news?
Hi @panaji, could you try this the last version of Fluent-Bit 1.9.7?
@LatorreDev , do you have the commit/feature that may fix this issue? I think the error itself is legit, the concern is the spam that grow exponentially if we also tail fluentbit itself.
@LatorreDev yes this behavior persists into version 1.9.7
I believe I'm experiencing the same problem in 1.9.8:
` $ head -n20 dump Fluent Bit v1.9.8
[2022/11/03 04:06:55] [ info] [fluent bit] version=1.9.8, commit=, pid=418 [2022/11/03 04:06:55] [ info] [storage] version=1.2.0, type=memory+filesystem, sync=normal, checksum=enabled, max_chunks_up=128 [2022/11/03 04:06:55] [ info] [storage] backlog input plugin: storage_backlog.6 [2022/11/03 04:06:55] [ info] [cmetrics] version=0.3.6 [2022/11/03 04:06:55] [ warn] [input:thermal:thermal.3] thermal device file not found [2022/11/03 04:06:55] [ info] [input:systemd:systemd.5] seek_cursor=s=eed386c868434a799bd4c3b033621892;i=51d... OK [2022/11/03 04:06:55] [ info] [input:storage_backlog:storage_backlog.6] queue memory limit: 4.8M [2022/11/03 04:06:55] [ info] [output:file:file.0] worker #0 started [2022/11/03 04:06:55] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 [2022/11/03 04:06:55] [ info] [sp] stream processor started [2022/11/03 04:06:55] [ warn] [input] emitter.7 paused (mem buf overlimit) [2022/11/03 04:06:55] [error] [input:emitter:at_escaper] error registering chunk with tag: systemd.user_at_375336.service [2022/11/03 04:06:55] [error] [input:emitter:at_escaper] error registering chunk with tag: systemd.user_at_375336.service `
I'm launching the FLB process inside a container of its own. However I'm not seeing lots of usage in the on-disk buffer:
$ ls -l /var/lib/fluent-bit-storage/journal_systemd_db -rw-r--r-- 1 nobody nogroup 8192 Nov 2 23:57 /var/lib/fluent-bit-storage/journal_systemd_db
After debugging last night, I have not solved the issue, but am quite confident its from a misbehaving rewrite_tag filter. It apperas to either never match, OR match all records in which case it produces an infinite loo. There is definitely a bug in (my understanding of rewrite_tag + its implementation)
FWIW I'm not even sure this error message (which continues to haunt me), is even an error message. I get it spammed at my for 13seconds at start up, then it goes silent and the system "recovers". I think FLB is maybe overallocating the number of workers and running out of chunk space? I don'treally know, but now its very hard for me to tell if I'm losing data or not. Also systemd isn't even producing much data because I have tail_from_logs on. I would have thought thats a slow rate, but the problem does not exist for just the DUMMY input.
This error took down a service today. I was running fluent-bit 2.0.4 and the logs were were full of "error registering chunk with tag" errors, and fluent-bit consumed nearly 100% CPU across all cores. We had to stop/restart fluent-bit as a temporary fix. I counted fluent-bit spamming its own logs about 50x per second.
I suspect it's entering an infinite loop where it has trouble sending all the logs to AWS Cloudwatch, which causes it to backup and generate more error logging, which adds to the queue of logs to AWS Cloudwatch, which increases the size of the backlog, causing more self-spamming of the logs, completing a loop which starts a death spiral.
Regardless of the other aspects of this, sometimes Fluent Bit is spamming its own logs with the exact same log entry more than 10x in the same second, and that's a problem on its own. Other logging systems will just switch to a summary, like "last log entry repeated 20x times".
Also, the log entry of "error registering chunk with tag" doesn't seem like the real root case, because the filters work fine before fluent-bit enters a death spiral.
A couple thoughts here
@agup006 , i couldn't repro this, but few people seem to have the same problem. If we can have a feature where fluentBit does not generate the same error messages more than n times per sec/min, we can probably investigate further. Right now, the logs are just filled with this message.
It took several days of normal functioning for the error to appear for me, so I'm not confident I can offer an isolated repro, as I'm not certain of the trigger. I'm only certain that log-spamming the same error repeatedly wasn't helping and might have been part of triggering a logging death spiral. One workaround I could try is to quit forwarding logs for the fluent-bit
process itself to AWS Cloudwatch, which would also prevent fluent-bit's own log volume from contributing to overloading the connection to a remote server. Now that I've experienced this, that seems like a safer design to avoid creating a bad loop any time a problem with an overloaded remote service would cause fluent-bit to generate even more logging to the remote service.
I dug through my logs some more to see if I could find what Fluent Bit was logging just before the death spiral log spam sequence happened, it case the preceding logs were interesting. Here are the log entries I found before the repetitive log spam:
[ warn] [input] emitter.1 paused (mem buf overlimit)
[error] [tls] error: error:00000005:lib(0):func(0):DH lib"}
[error] [build/src/CMakeFiles/fluent-bit-static.dir/compiler_depend.ts:1199 errno=25] Inappropriate ioctl for device"}
(the order there isn't necessary chronological)
Around the same time, an ill-configured test app was dumping a massive amount of debugging data to the logs, and Fluent Bit was trying to process that.
Does that help?
@vwbusguy recognized the death spiral description as being caused by Fluent Bit both reading to the systemd journal and also reading from it... which happens to be my configuration. In practice, most of the time that configuration works, but once Fluent Bit starts spamming its own logs at some point a threshold is crossed and Fluent Bit can't keep up with processing as input the logs it's generating as output...
So I'll update my configuration to break that loop as a workaround, while this issue can stay focused on preventing Fluent Bit from generating the same log entries repeatedly.
It would be good to have a metric exposed when memory limits like Emitter_Mem_Buf_Limit
are reached so people can alerts or act on that.
I've been hitting this issue as well in k8s where the fluent-bit pod frequently becomes OOM and crashes.
On a 30 node cluster, with fluent-bit deployed as a daemonset, there was only one fluent-bit pod repeatably crashing. That pod was on the same node as a very log spammy pod. To test it was the spammy pod, I excluded it and the fluent-bit pod from the path in the tail config. That stopped the crashing, but that's hardly a fix, as I want those logs as well.
I managed to replicate the errors on a testing environment, where I had to
Then I got the error logs right away the tens of thousands
[2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.log-spammer [2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.log-spammer [2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.fluent-bit-forwarder [2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.fluent-bit-forwarder
Although I didn't manage to crash it on the testing environment, it was able to make the error message.
The problem I had on the large k8s cluster seems to have fixed on it's own the the components got redeployed and there wasn't a massive backlog of logs for the fluent-bit pods to process.
Hope this can help anyone else.
@agup006 Sorry for the ping, not sure if the team noticed that there is a potential way to reproduce the issue here above.
I think tailing from a large enough file might reproduce the issue or at least the error messages, in case devs don't have a k8s cluster to test it on.
Let me know if there's any more info I can provide.
We ran into the same issue with systemd multiline parsing using 1.9.10:
[2023/01/17 15:07:45] [error] [input:emitter:emitter_for_systemd_linkerd_multiline] error registering chunk with tag: development.linkerd
Perhaps caused by a large batch of unsent logs due to an output misconfiguration. It is extremely concerning that Fluent Bit fails in this manner.
Hey folks, an update here we have merged a "Log Suppression feature" https://github.com/fluent/fluent-bit/pull/6435 which should be released soon. This should help with errors that continue to show up
I built from master and deployed to a k8s cluster, and it did not seem to have any affect on the "error registering chunk with tag".
@stebbib the PR referenced is not for that specific log error it is a more general fix to stop any spamming message error once configured. We’re you able to add the configuration referenced in the Pr?
Yes, added it to the output config I had and that seemed to work, at least without complaint from Fluent-Bit 😄
We saw this when the es
output host was taken down intentionally to test the disk based buffering.
The storage.total_limit_size
is 1G but once this spamming is triggered it just kept growing the disk usage beyond the limit. We are very concerned about the potential for this type of disruption in our production clusters. In the mean time, we will test the new Log_Suppress_Interval
setting as a workaround.
# fluent-bit version
Fluent Bit v1.9.10
# du -h /var/log/
...
46G /var/log/flb-storage/tail.0
46G /var/log/flb-storage
46G /var/log/
# cat fluent-bit/etc/fluent-bit.conf
[SERVICE]
...
flush 1
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.max_chunks_up 128
storage.backlog.mem_limit 12M
Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Parser docker
Docker_Mode On
Mem_Buf_Limit 12MB
Skip_Long_Lines On
Refresh_Interval 10
# https://docs.fluentbit.io/manual/pipeline/inputs/tail
storage.type filesystem
Rotate_Wait 5
Ignore_Older 1h
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
Merge_Log_Key data
Keep_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
Buffer_Size 32k
[FILTER]
Name modify
Match *
# add the cluster name to all records
Add cluster_name some-cluster
[FILTER]
Name rewrite_tag
Match kube.*
# prepend with the log type when value exists
Rule $kubernetes['labels']['logType'] ^([a-zA-Z0-9]+)$ $0.$TAG false
# prepend with "k8s" as the fallback rule so the tag always have the same number of parts
Rule $kubernetes['namespace_name'] .* k8s.$TAG false
[OUTPUT]
Name es
Match *
AWS_Region us-west-2
AWS_Auth On
Host some-host
Port 9200
Retry_Limit 6
Replace_Dots On
Trace_Error true
Logstash_Format on
# takes precedence when value present
Logstash_Prefix_Key $kubernetes['labels']['logType']
# fallback when "Logstash_Prefix_Key" value not present
Logstash_Prefix k8s
storage.total_limit_size 1G
Suppress_Type_Name On
Retry_Limit 2
[OUTPUT]
Name s3
Match *
bucket some-bucket
region us-west-2
upload_timeout 10m
retry_limit 2
compression gzip
store_dir_limit_size 1G
Hi all, I confirm what @stebbib stated, it seems like it doesn't have any affect on the "error registering chunk with tag" log.
I’m going to be evaluating Vector instead. This is a critical issue.
Mark Stosberg (he/him)
Director of Systems & Security
@.*** | 765.277.1916
https://www.rideamigos.com https://rideamigos.com/
Changing the way the world commutes.
https://www.linkedin.com/company/rideamigos https://www.twitter.com/rideamigos https://www.facebook.com/rideamigos https://www.instagram.com/rideamigos https://rideamigos.com/newsletter-sign-up/
@markstos @terencebor @gz243 did you try the log suppression feature mentioned above?
@agup006 correct me if I'm wrong, the log suppression feature works only for output plugins, unlike fluentd where the rewrite-tag is an output plugin, the rewrite-tag filter in fluent-bit is a filter plugin and as such can't use the log suppression feature.
The 2.x upgrade also deleted the configuration file: https://github.com/fluent/fluent-bit/issues/6377
Finding two critical issues during my evaluation was all I needed to lose faith and move on. I need my logging solution to be rock solid, not delete its configuration file and crash the system.
On Wed, Mar 15, 2023 at 9:28 AM Anurag Gupta @.***> wrote:
@markstos https://github.com/markstos @terencebor https://github.com/terencebor @gz243 https://github.com/gz243 did you try the log suppression feature mentioned above?
— Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/5769#issuecomment-1470012393, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGJZMUMJXBYGHK4WLOAZDW4G7W7ANCNFSM54JFUZOA . You are receiving this because you were mentioned.Message ID: @.***>
--
Mark Stosberg (he/him)
Director of Systems & Security
@.*** | 765.277.1916
https://www.rideamigos.com https://rideamigos.com/
Changing the way the world commutes.
https://www.linkedin.com/company/rideamigos https://www.twitter.com/rideamigos https://www.facebook.com/rideamigos https://www.instagram.com/rideamigos https://rideamigos.com/newsletter-sign-up/
@markstos Thanks for assisting in that issue, and understand we all have to keep our services up and running :). All the best and appreciate your contributions
@agup006 correct me if I'm wrong, the log suppression feature works only for output plugins, unlike fluentd where the rewrite-tag is an output plugin, the rewrite-tag filter in fluent-bit is a filter plugin and as such can't use the log suppression feature.
This should be specifically for the log files that Fluent Bit generates itself, adding @lecaros @RicardoAAD who might have some sample config we can leverage to set it up
@agup006 , as others have reported (1, 2), I continue to see the error spammed after adding the log_suppress_interval
to all my output plugins:
[OUTPUT]
Name es
log_suppress_interval 10s
...
[2023/03/15 16:09:58] [ info] [fluent bit] version=2.0.9, commit=4c0ca4fc5f, pid=1
[2023/03/15 16:20:34] [error] [input:emitter:emitter_for_rewrite_tag.2] error registering chunk with tag: k8s.kube.var.log.containers.fluent-bit-6jdm9_logging_fluent-bit-ce93fba6f71656419554b66865b71e38a194e3a70657e2c357a2a84b294b8de3.log
[2023/03/15 16:20:34] [error] [input:emitter:emitter_for_rewrite_tag.2] error registering chunk with tag: k8s.kube.var.log.containers.fluent-bit-6jdm9_logging_fluent-bit-ce93fba6f71656419554b66865b71e38a194e3a70657e2c357a2a84b294b8de3.log
[2023/03/15 16:20:34] [error] [input:emitter:emitter_for_rewrite_tag.2] error registering chunk with tag: k8s.kube.var.log.containers.fluent-bit-6jdm9_logging_fluent-bit-ce93fba6f71656419554b66865b71e38a194e3a70657e2c357a2a84b294b8de3.log
Additional details:
rewrite_tag
filter the error goes away.storage.total_limit_size
is not respected.I concur with @stebbib 's comment that it does not seem the log suppression as it stands now applies to filters (only outputs). This is further correlated by this open issue as well as this comment.
It remains unclear to me as to why this error is triggered in the first place. The log suppression is a handy work around when it works. But another serious concern for me is why storage.total_limit_size
is not respected irrespective of this error spam.
@agup006 correct me if I'm wrong, the log suppression feature works only for output plugins, unlike fluentd where the rewrite-tag is an output plugin, the rewrite-tag filter in fluent-bit is a filter plugin and as such can't use the log suppression feature.
This should be specifically for the log files that Fluent Bit generates itself, adding @lecaros @RicardoAAD who might have some sample config we can leverage to set it up
Hello @agup006 as @stebbib has mentioned, this option only acts on messages from output plugins that look similar within an interval of time.
We have created a public FR https://github.com/fluent/fluent-bit/issues/6873 to extend this functionality to input plugins, and other fluent-bit components such as storage, engine, etc.
@agup006 correct me if I'm wrong, the log suppression feature works only for output plugins, unlike fluentd where the rewrite-tag is an output plugin, the rewrite-tag filter in fluent-bit is a filter plugin and as such can't use the log suppression feature.
This should be specifically for the log files that Fluent Bit generates itself, adding @lecaros @RicardoAAD who might have some sample config we can leverage to set it up
Hello @agup006 as @stebbib has mentioned, this option only acts on messages from output plugins that look similar within an interval of time.
We have created a public FR #6873 to extend this functionality to input plugins, and other fluent-bit components such as storage, engine, etc.
@RicardoAAD we are seeing the same issue as @gz243 . The log suppression at the output filter doesn't seem to be respected in 2.0.9. In which version is this avaialble?
Also, as mentioned here already, the issue is in the rewriter. Is there any update on why it keep erroring?
@RicardoAAD, @agup006, @lecaros Just tested again in 2.1.2 and the suppressor doesn't seem to work
Config:
[OUTPUT]
Name es
Match custom.*
Host ${ES_HOST}
Port 443
HTTP_User ${ES_USER}
HTTP_Passwd ${ES_PWD}
Index ${ES_TELEMETRY_INDEX}
Logstash_Format Off
Retry_Limit 10
Suppress_Type_Name On
Replace_Dots On
Trace_Error On
Trace_Output Off
Tls On
Id_Key TelemetryId
log_suppress_interval 10s
Output
What does this error : "error registering chunk with tag" mean in the first place. Seeing this error, only when fluentbit is not scaling well. @panaji how did you manage to get rid of this error ?
@PettitWesley would you know anything about this ?
Given
log_suppress_interval
doesn't seem to work at the output level as shown by multiple instances hereThe only way I found so far to at least avoid flooding the backend is
[FILTER]
Name grep
Match fluent.*
Exclude log input:emitter:emitter_for_rewrite_tag
I've seen this happen when the emitter_mem_buf_limit is exceeded
I had the same problem when. Problems were issued when emitter_mem_buf_limit exceeded. I've changed
[FILTER]
buffer On
emitter_mem_buf_limit 128MB
also, add workers to the output
[OUTPUT]
Name es
Workers 2
@PettitWesley could you also help out with how do we arrive at an optimum value for the emitter_mem_buf_limit parameter depending upon my input and output byte rates, there is very less about this in the documentation, should this be equal to my max chunk size ?
@srikanth-burra I have an estimation calculation here: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention
Hey @PettitWesley sorry for the wrong context.
@srikanth-burra the doc I shared explains how to limit memory in general, and can be used as a reference outside of FireLens. Also emitter_mem_buf_limit and normal mem_buf_limit are the same thing.
What is the default 'mem_buf_limit' for tail input plugin?
@grv180290 there is no default for mem_buf_limit, it defaults to unlimited. I recommend this guide to understand memory settings: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention
@PettitWesley Then, if the Docker container memory was set to a maximum of 4G,
There are a total of 2 inputs Both set mem_buf_Limit to 512MB.
I also gave an emitter_mem_buf_limit of 3G to the rewrite_tag filter that references one of the inputs.
So should I set the maximum memory to 8G? Or can I just calculate the estimate with mem_buf_Limit in the input and set the maximum to 2G?
Total Max Memory Usage <= 2 * SUM(Each input Mem_Buf_Limit)
I referred to this.
Adjusting the emitter_mem_buf_limit
worked in my case.
How does one go about estimating the precise value for emitter_mem_buf_limit
. Unlike the mem_buf_limits and chunk size, there's corresponding Prometheus metrics that can help estimate these values. Is there something similar for the emitter? If not, is something like this possible?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
Bug Report
I see these errors in the aggregator few seconds after it starts. Usually, I see this error after reaching Emitter_Mem_Buf_Limit.
Our forwarder also tail fluentbit logs which exacerbates the problem. These errors will be tailed and re-forwarded to the aggregator which then again generate the same error, and the cycle continues. I don't see how to repro this, but is there any feature to suppress error messages, for example, don't generate this error more than x times per min?
To Reproduce
Expected behavior FluentBit should not generate the same error messages more than n times per sec/min.
Screenshots
Your Environment
Additional context