Closed staniondaniel closed 6 months ago
Same issue, i generate test log with about 10MB/S, fluent-bit will log bounch of error log.
Is there any way to work around? @edsiper
Hi, i may found this issue is intruduced by PR:https://github.com/fluent/fluent-bit/pull/4383. When i downgrade fluent-bit to 1.8.11, flb works pretty well
We are experiencing exactly the same issue, with the main problem being that we actually need the functionality of the previously mentioned PR for the multiline logs.
We use the amazon/aws-for-fluent-bit and run it next to our own containers using the built-in integration with firelens
[2022/03/07 10:10:52] [error] [input:emitter:emitter_for_multiline.2] error registering chunk with tag: xxx-firelens-GUID
The container has 50MB of allocated memory and it used to be more that enough, when now it used up the memory in 3 days and started just returning errors.
@PettitWesley could you have a look if there are any issues in your implementation or if there are any workarounds?
I have added this on my personal backlog, which unfortunately right now is very long.
My recommendation is to take a look at past issues involving the rewrite_tag filter, with the new code, the multiline filter with buffer is very similar to rewrite_tag. This same issue should have arisen with that filter as well.
So when I send multiline logs at a very high throughput, all I get is this:
[2022/04/04 03:38:14] [ warn] [input] emitter.1 paused (mem buf overlimit)
Which can be fixed by increasing the mem_buf_limit for the emitter:
[FILTER]
name multiline
match *
multiline.key_content log
multiline.parser go, multiline-regex-test
emitter_mem_buf_limit 1G
Please see: https://github.com/fluent/fluent-bit/issues/5235
If you have more than one multiline filter definition and they match the same records, it can cause all sorts of trouble.
So I want to direct everyone to this issue again: https://github.com/fluent/fluent-bit/issues/5235
It is not clear to me that there is a widespread use case that truly requires multiple filter definitions for the same logs. Please post in that issue if you do not understand or you think you have a use case that requires multiple filter definitions that match the same logs.
5235
it is not clear to me what do you mean multiple filter definitions. Do you mean more of these here:
multiline.parser java,go,python
or multiple config like the above in different section:
`
[INPUT]
multiline.parser docker,cri
[FILTER]
multiline.parser java,go,python
`
I am using the above config because if i try to add the line to the INPUT section like this `
[INPUT]
multiline.parser java,go,python
` It simply doesn't work and log is not merged
So this is why i am doing it in 2 stages:
@staniondaniel Multiple fitler means like I show in the issue like this:
[FILTER]
name multiline
match *
multiline.key_content log
multiline.parser go
[FILTER]
name multiline
match *
multiline.key_content log
multiline.parser multiline-regex-test
Instead, the CORRECT way is what you show, to have multiple parsers in a comma list.
It simply doesn't work and log is not merged
that's interesting. In this case what you are doing is fine, you only have one filter definition. It's fine to use both the filter and tail input multiline.parser
keys.
I found there is a new buffer
config parameter, set it to off seems to avoid this issue.
doc: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters
[FILTER]
Name multiline
match kube.*
multiline.key_content log
multiline.parser java
buffer off
@erenming With buffer off, the filter will only work with a few inputs, mainly tail. As noted in the docs.
@PettitWesley Really thanks for your remind. Actually, we only use tail, so i think with buffer off will be okay. But i may found a issue with buffer off for these days, multiline seems not working :( This is my sample config:
[SERVICE]
flush 0.5
daemon Off
grace 5
log_level info
parsers_file parsers.conf
http_server On
http_listen 0.0.0.0
http_port 2020
[INPUT]
name tail
path test.log
DB flb_k8s.db
Tag kube.*
read_from_head true
#ignore_older 5m
Buffer_Chunk_Size 1MB
Buffer_Max_Size 10MB
Skip_Long_Lines On
Skip_Empty_Lines On
Refresh_Interval 30
Rotate_Wait 60
DB.locking true
DB.journal_mode WAL
Mem_Buf_Limit 300MB
multiline.parser docker, cri
[FILTER]
Name multiline
match *
multiline.key_content log
multiline.parser java
buffer Off
[OUTPUT]
name stdout
match *
format json
and sample log:
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:25:28.826167874Z"}
{"log":"io.terminus.trantorframework.exception.BusinessException: Schedule.request.param.is.illegal\n","stream":"stdout","time":"2021-08-16T08:24:28.826172954Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.util.ValidateUtil.expectedTrue(ValidateUtil.java:39)\n","stream":"stdout","time":"2021-08-16T08:24:28.82617719Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.func.schedule.validate.ValidateScheduleParamForCalculationFuncImpl.execute(ValidateScheduleParamForCalculationFuncImpl.java:47)\n","stream":"stdout","time":"2021-08-16T08:24:28.826182475Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
Is there any mistake in my configuration?
Is there any mistake in my configuration? I believe he meant that you need to use the parser config directly in the INPUT section and not in the [FILTER] section . something like:
[INPUT] name tail ...... Mem_Buf_Limit 300MB multiline.parser docker, cri, java
@PettitWesley as an update to the original problem. After updating to the latest Fluentbit version I noticed that using only the " multiline.parser docker,cri" in the INPUT section without any [FILTER] config like bellow:
[INPUT]
Name tail
......
multiline.parser docker,cri
the pods don't crash anymore due to high MEMORY usage
As soon as I add the [FILTER] config as bellow:
[FILTER]
name multiline
Emitter_Mem_Buf_Limit 2.4GB
match kube.*
multiline.key_content log
multiline.parser java,go,python
The problem reoccurs and the memory slowly rises until it hits the buffer limit which causes the logs to be flooded with :
[input:emitter:emitter_for_multiline.0] error registering chunk with tag:
or it gets killed by kubernetes when memory usage of the pods hits the limits set for it
From what I noticed this is simply an issue of the multiline filter not being able to process the data fast enough or just not releasing the buffer once the data is processed which causes high cpu usage and leads to a crash eventually
@PettitWesley We have been facing the same issue with aws-for-fluent-bit:2.25.0. Our custom fluentbit configuration is below.
[SERVICE] flush 1 storage.path /var/log/app-storage/ storage.sync normal [INPUT] Name forward unix_path /var/run/fluent.sock Mem_Buf_Limit 50MB storage.type filesystem [FILTER] name multiline match * multiline.key_content log multiline.parser docker,cri,java Emitter_Mem_Buf_Limit 50M emitter_storage.type filesystem
Firelens configuration
logConfiguration = {
logDriver = "awsfirelens"
options = {
"Name": "datadog",
"apikey": "
Logs do not show any error. It just fails with segmentation fault.
`--------------------------------------------------------------------------------------------------------------------------------------------------------- | timestamp | message |
---|---|---|
1656096324683 | [2022/06/24 18:45:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096325650 | [2022/06/24 18:45:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096326704 | [2022/06/24 18:45:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096329689 | [2022/06/24 18:45:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096331692 | [2022/06/24 18:45:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096332695 | [2022/06/24 18:45:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096333726 | [2022/06/24 18:45:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096334685 | [2022/06/24 18:45:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096336749 | [2022/06/24 18:45:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096338648 | [2022/06/24 18:45:38] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096344687 | [2022/06/24 18:45:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096346709 | [2022/06/24 18:45:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096347677 | [2022/06/24 18:45:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096348703 | [2022/06/24 18:45:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096349674 | [2022/06/24 18:45:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096350678 | [2022/06/24 18:45:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096351692 | [2022/06/24 18:45:51] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096352721 | [2022/06/24 18:45:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096353671 | [2022/06/24 18:45:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096354685 | [2022/06/24 18:45:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096355694 | [2022/06/24 18:45:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096361716 | [2022/06/24 18:46:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096363714 | [2022/06/24 18:46:03] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096364712 | [2022/06/24 18:46:04] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096365644 | [2022/06/24 18:46:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096368704 | [2022/06/24 18:46:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096369701 | [2022/06/24 18:46:09] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096372686 | [2022/06/24 18:46:12] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096375691 | [2022/06/24 18:46:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096376685 | [2022/06/24 18:46:16] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096378678 | [2022/06/24 18:46:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096380698 | [2022/06/24 18:46:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096383700 | [2022/06/24 18:46:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096384679 | [2022/06/24 18:46:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096386694 | [2022/06/24 18:46:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096388690 | [2022/06/24 18:46:28] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096390698 | [2022/06/24 18:46:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096391696 | [2022/06/24 18:46:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096392705 | [2022/06/24 18:46:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096393696 | [2022/06/24 18:46:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096395689 | [2022/06/24 18:46:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096396682 | [2022/06/24 18:46:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096399692 | [2022/06/24 18:46:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096401703 | [2022/06/24 18:46:41] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096403693 | [2022/06/24 18:46:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096404708 | [2022/06/24 18:46:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096405694 | [2022/06/24 18:46:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096406682 | [2022/06/24 18:46:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096410688 | [2022/06/24 18:46:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096412651 | [2022/06/24 18:46:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096413683 | [2022/06/24 18:46:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096414671 | [2022/06/24 18:46:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096416720 | [2022/06/24 18:46:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096418662 | [2022/06/24 18:46:58] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096419663 | [2022/06/24 18:46:59] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096420682 | [2022/06/24 18:47:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096421696 | [2022/06/24 18:47:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096422655 | [2022/06/24 18:47:02] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096425696 | [2022/06/24 18:47:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096428686 | [2022/06/24 18:47:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096430710 | [2022/06/24 18:47:10] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096431661 | [2022/06/24 18:47:11] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096434709 | [2022/06/24 18:47:14] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096435655 | [2022/06/24 18:47:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096437670 | [2022/06/24 18:47:17] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096438655 | [2022/06/24 18:47:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096439675 | [2022/06/24 18:47:19] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096440691 | [2022/06/24 18:47:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096441701 | [2022/06/24 18:47:21] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096443747 | [2022/06/24 18:47:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096445704 | [2022/06/24 18:47:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096449695 | [2022/06/24 18:47:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096450679 | [2022/06/24 18:47:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096451681 | [2022/06/24 18:47:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096452669 | [2022/06/24 18:47:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096453645 | [2022/06/24 18:47:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096454693 | [2022/06/24 18:47:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096455683 | [2022/06/24 18:47:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096456675 | [2022/06/24 18:47:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096457698 | [2022/06/24 18:47:37] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096459704 | [2022/06/24 18:47:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096460664 | [2022/06/24 18:47:40] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096462681 | [2022/06/24 18:47:42] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096463700 | [2022/06/24 18:47:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096464691 | [2022/06/24 18:47:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096465694 | [2022/06/24 18:47:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096466652 | [2022/06/24 18:47:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096467653 | [2022/06/24 18:47:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096468696 | [2022/06/24 18:47:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096469747 | [2022/06/24 18:47:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096472717 | [2022/06/24 18:47:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096473665 | [2022/06/24 18:47:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096474674 | [2022/06/24 18:47:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096475721 | [2022/06/24 18:47:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096476664 | [2022/06/24 18:47:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096477653 | [2022/06/24 18:47:57] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096480673 | [2022/06/24 18:48:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096481704 | [2022/06/24 18:48:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={} | |
1656096482634 | [2022/06/24 18:48:02] [engine] caught signal (SIGSEGV) | |
1656096483998 | AWS for Fluent Bit Container Image Version 2.25.0 |
---------------------------------------------------------------------------------------------------------------------------------------------------------`
Sorry folks, I may not be able to take a look at this immediately.
If someone is willing to try these techniques to get a stack trace for me that would help speed up the fix: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
Not stale.
unstale
Having same issue here. Container use all request causing high CPU load.
service: |
[SERVICE]
Daemon Off
Flush 1
Log_Level info
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
inputs: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Mem_Buf_Limit 128MB
Skip_Long_Lines On
Refresh_Interval 10
Buffer_Chunk_Size 256k
Buffer_Max_Size 100MB
multiline.parser docker, cri
filters: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log Off
Merge_Log_Key log_processado
Merge_Log_Trim On
Keep_Log Off
K8S-Logging.Parser Off
K8S-Logging.Exclude Off
Labels On
Annotations Off
[FILTER]
Name multiline
Match kube.*
multiline.key_content log
multiline.parser go, java, python
buffer On
emitter_mem_buf_limit 64M
fluentbit 1.9.9
I'm facing the same issue when working with the multiline parser with the rewrite-tag filter. I'm using the 0.21.6 version of the fluentbit helm chart that uses the 2.0.6 version of fluentbit.
I'm using the tail input:
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/my-app*.log
multiline.parser docker, cri
Mem_Buf_Limit 10MB
Refresh_Interval 10
And using the re-write tag like this:
[FILTER]
Name rewrite_tag
Match kube.*
Rule $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false
I can't really use the exclude filter in order to filter out these logs as I need them but I want them to go to other output than the other logs. Any idea how can I solve this issue? Is there any workaround for this?
This is my complete fluentbit configuration:
config:
service: |
[SERVICE]
Daemon Off
Flush {{ .Values.flush }}
Log_Level {{ .Values.logLevel }}
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.metricsPort }}
Health_Check On
inputs: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/my-app-server*.log
multiline.parser docker, cri
Mem_Buf_Limit 10MB
Refresh_Interval 10
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/my-other-app*.log
multiline.parser docker, cri
Mem_Buf_Limit 10MB
Refresh_Interval 10
filters: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
Merge_Log_Key data
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix kubernetes_
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes_labels
Add_prefix kubernetes_labels_
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes_annotations
Add_prefix kubernetes_annotations_
[FILTER]
Name rewrite_tag
Match kube.*
Rule $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false
outputs: |
[OUTPUT]
Name s3
Match audit-logs.*
bucket my-bucket-name
region {{ .Values.region }}
total_file_size 10M
upload_timeout 1m
use_put_object on
compression gzip
auto_retry_requests true
s3_key_format /namespace=$TAG[1]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
role_arn {{ .Values.fluentbit_role_arn }}
[OUTPUT]
Name s3
Match kube.*
bucket lakefs-audit-logs-{{ .Values.region }}-{{ .Values.environment }}
region {{ .Values.region }}
total_file_size 10M
upload_timeout 1m
use_put_object on
compression gzip
auto_retry_requests true
s3_key_format_tag_delimiters ._
s3_key_format /namespace=$TAG[5]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
role_arn {{ .Values.fluentbit_role_arn }}
[OUTPUT]
Name loki
Match kube.*
Host {{ .Values.loki.endpoint }}
port 443
tls on
tls.verify on
http_user {{ .Values.loki.username }}
http_passwd {{ .Values.loki.password }}
auto_kubernetes_labels true
Labels collector=fluent-bit, namespace=$kubernetes_namespace_name, pod=$kubernetes_pod_name, container=$kubernetes_container_name, Region={{ .Values.region }}
I know it's ugly, but since i really need this working, i did setup a cronjob to rollout restart the fluent-bit pods every day at 03:00am.
@PettitWesley Any idea if similar solution is supported for AWS ECS fargate as suggested by @jujubetsz ?
I'm currently making some unrelated improvements that involve modifying the multi line core machinery so if anyone is able to put together a somewhat reliable reproduction (it would be awesome if it was minimized but understandable if that's not possible) with sample input data I could look into this.
If you need help putting together the reproduction case Just tag me or here or talk to me in slack.
@rajeev-netomi I don't recommend restarting FLB at a certain time if its crashing... I'm not sure of a good solution here but I think you want to restart after it crashes. Or see if you can find an older version which doesn't crash. Sorry!
Yeah @PettitWesley. Basically marking fluentbit as a non-essential container so that if it crashes we do not stop the entire task and restart the non-essential container. This will be very useful for scenarios where we can tolerate loss of logging for the short duration.
On the contrary, if you need to stick with the version that crashes I would recommend preventively restarting the service. Think about it, when the binary crashes it means there probably was memory corruption and realistically speaking, the point in time where the invalid access caused the crash was was not the first so the longer you let it go, the more odds you have of whatever memory corruption issue there is affecting the integrity of the chunks (the data files that hold the ingested data while in transit).
So, ideally, I would recommend trying to downgrade to a version that doesn't crash but if you are forced to stick with the one, then I think you are better off actively recycling the pod rather and retaining control rather than waiting for it to "take care of itself".
I take that you don't have a reliable reproduction right?
@leonardo-albertovich Currently there does not exist a stable build which does not have this issue. We have a very simple requirement that we want to concatenate multiline logs into single log event and for that we are using inbuilt filters. Below is the configuration file.
[FILTER] name multiline match * multiline.key_content log multiline.parser java
Dockerfile
FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:2.29.1 ADD java-multi-line.conf /fluent-bit/configs/java-multi-line.conf
This container is crashing for the applications where there is high volume of logs.
For anyone facing similar issues, we tried a lot of different configurations including increasing memory limits but it didn't work. In the end, we had to replace fluent-bit with fluentd to make our logging work with the multiline parser.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 https://github.com/fluent/fluent-bit/issues/7782
I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.
I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782
I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.
Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and "error registering chunk with tag" messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config. "buffer off" eliminates errors on version 2.1.2, but multiline filter stops to work with "buffer off" configured in [FILTER] despite used with tail input.
I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782 I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.
Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and "error registering chunk with tag" messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config. "buffer off" eliminates errors on version 2.1.2, but multiline filter stops to work with "buffer off" configured in [FILTER] despite used with tail input.
Hi terence. Thanks for the update. Indeed you are right , I just checked 2.0.9 with my application and cpu is Ok. So seems the regression started after 2.0.9. Im really surprised no one is looking at this yet. I would say this is a critical bug. https://github.com/fluent/fluent-bit/issues/7782
Tried fluentbit v2.1.9 - don't see the "error registering chunk" errors anymore. But I still have the high CPU problem.
Is there any update on the status of this issue . I would say this is pretty critical.
Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I've tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.
Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I've tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.
I must say this is strange, that no one seems to be looking into this. Anyway I suggest you either return to your original version for now , or upgrade to 2.0.6. I think this is the last version that doesnt have this issue.
I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10
I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10
Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.
I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10
Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.
Hi Ryan, I confirm that 2.0.9 version doesn't have the CPU issue.
Same here, 2.0.9 works , tried 2.1.x without success, didn't tried 2.2.x yet
2.2.2 is still not working
2.2.2 is still not working
I'm absolutely shocked that no one is handling this issue. Quite worrying, We simply cant upgrade.
Any workaround or update on above issue.
I took a look at this issue today.
The root problem: the internal in_emitter
plugin that the multiline filter uses to emit the joined logs does not respect the FLB_INPUT_PAUSED
state. This state is set and checked automatically in flb_input_chunk.c
, and those routines will defensively refuse to append to chunks if the plugin is paused. This is not a problem for most input plugins, as they will simply pause and resume their collectors as the overall input instance is paused. This is not the case for the emitter, where any plugin can just append records to it at any time. So the multiline plugin indiscriminately slams the emitter with new logs, causing the emitter plugin to fail to append most of these and drop the logs. Worth noting that this is different from the documented behaviour in the emitter_mem_buf_limit field
I actually have no idea how to implement this behaviour; I couldn't find a way for a filter to wait for an input plugin to resume without blocking the entire program (thus deadlocking it essentially, since the emitter wouldn't be able to process the logs it has to make room so it can't ever resume).
This problem exists for both multiline and rewrite_tag, as their usage of emitter (just adding records directly to it) is the same.
I opened a PR shown above that cuts out the middleman and checks for the emitter to be paused before even trying to write to it. This will at least stop the spamming error log. Technically this issue is about high CPU, and this PR does not solve it. The multiline processor is very CPU intensive in general, so when it's run as a filter on the main thread it's no wonder it takes a whole CPU. I'm not sure what the solution to that is.
As a user, it is basically necessary with the design of this plugin to take steps to make the plugin resilient to higher throughputs. This is actually relatively the same as what you'd need to do to any input plugins, it's just a bit odd because you wouldn't think of needing to use those kinds of strategies on a filter plugin. However, to use this plugin effectively, it is essentially necessary to understand that under-the-hood detail of the emitter
also being an input plugin which can suffer similar problems to any other input plugin.
Here are the solutions I would recommend, which are not dissimilar to any input plugin:
storage.max_chunks_up
amount. These numbers all depend on what you set for the global storage
settings in [SERVICE]
(side note: it would be great to be able to set these for the emitter). Read more at https://docs.fluentbit.io/manual/administration/buffering-and-storage@braydonk : I switched emitter to filesystem buffering but running into another issues where Kubernetes Fluent Bit not recovering after Fluentd restart ,chunks were stuck in storage. My setup fluentbit(2.0.6)->fluentd(1.16.3)->ES/Kibana.After restarting the fluentbit all the chunks were processed from the storage. I enabled the debug mode and looks like flushing is very slow.I tried the below scenario
I repeated the above same steps with one more addition of restarting the fluentbit service and saw chunks were processed in less than 10 minutes.
@saurabhgcect This sounds unrelated to this issue. I'd recommend opening a separate issue with config and logs included.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
I believe this should be fixed as of v3.0.3 which has #8473. I'll re-open and close this properly as the issue should be fixed, but if it is not then feel free to re-open with details.
Bug Report
Describe the bug
Hello
Multiline filter is crashing on pods that generate a large amount of logs after reaching Emitter_Mem_Buf_Limit . On pods with a normal/low number of logs it works without problems
To Reproduce
This is my configuration (i left only the relevant parts):
Your Environment
Additional context
Fluentbit container keeps crashing after it gets to the memory limit configured for that container. Also a lot of logs like
[error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag:
are flooding fluentbit logs