fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.91k stars 1.59k forks source link

Multiline filter crashes after buffer limit is reached causing high cpu and memory usage #4940

Closed staniondaniel closed 6 months ago

staniondaniel commented 2 years ago

Bug Report

Describe the bug

Hello

Multiline filter is crashing on pods that generate a large amount of logs after reaching Emitter_Mem_Buf_Limit . On pods with a normal/low number of logs it works without problems

To Reproduce

[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inode=4540032 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:06] [ info] [input:tail:tail.0] inotify_fs_add(): inode=4540032 watch_fd=113 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.5
[2022/02/25 15:40:07] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4592513 watch_fd=111
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=3353782 watch_fd=112
[2022/02/25 15:40:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=4540032 watch_fd=113
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inode=3353789 handle rotation(): /var/log/containers/xxxxx.log => /var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=3353789 watch_fd=114 name=/var/lib/docker/containers/xxxxx/xxxxx-json.log.7
[2022/02/25 15:40:27] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log
[2022/02/25 15:40:27] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.xxxxx.log

This is my configuration (i left only the relevant parts):

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        DB /var/log/containers/fluentbit_db.db
        Parser docker
        Tag kube.*
        Mem_Buf_Limit 10MB
        Buffer_Chunk_Size 256k
        Buffer_Max_Size 256k
        Skip_Long_Lines On
        Refresh_Interval 1
        multiline.parser docker,cri
......

    [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python
    [FILTER]
        Name kubernetes
        Buffer_Size 512k
        Match kube.*
        Merge_Log On
        Merge_Log_Key log_json
        Merge_Log_Trim On
        Keep_Log On
        K8S-Logging.Parser On
        K8S-Logging.Exclude On

....

Your Environment

Additional context

Fluentbit container keeps crashing after it gets to the memory limit configured for that container. Also a lot of logs like

[error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag:

are flooding fluentbit logs

erenming commented 2 years ago

Same issue, i generate test log with about 10MB/S, fluent-bit will log bounch of error log. image

Is there any way to work around? @edsiper

erenming commented 2 years ago

Hi, i may found this issue is intruduced by PR:https://github.com/fluent/fluent-bit/pull/4383. When i downgrade fluent-bit to 1.8.11, flb works pretty well

madmarcinadamski commented 2 years ago

We are experiencing exactly the same issue, with the main problem being that we actually need the functionality of the previously mentioned PR for the multiline logs. We use the amazon/aws-for-fluent-bit and run it next to our own containers using the built-in integration with firelens [2022/03/07 10:10:52] [error] [input:emitter:emitter_for_multiline.2] error registering chunk with tag: xxx-firelens-GUID The container has 50MB of allocated memory and it used to be more that enough, when now it used up the memory in 3 days and started just returning errors.

@PettitWesley could you have a look if there are any issues in your implementation or if there are any workarounds?

PettitWesley commented 2 years ago

I have added this on my personal backlog, which unfortunately right now is very long.

My recommendation is to take a look at past issues involving the rewrite_tag filter, with the new code, the multiline filter with buffer is very similar to rewrite_tag. This same issue should have arisen with that filter as well.

PettitWesley commented 2 years ago

So when I send multiline logs at a very high throughput, all I get is this:

[2022/04/04 03:38:14] [ warn] [input] emitter.1 paused (mem buf overlimit)

Which can be fixed by increasing the mem_buf_limit for the emitter:

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser      go, multiline-regex-test

     emitter_mem_buf_limit 1G
PettitWesley commented 2 years ago

Please see: https://github.com/fluent/fluent-bit/issues/5235

If you have more than one multiline filter definition and they match the same records, it can cause all sorts of trouble.

PettitWesley commented 2 years ago

So I want to direct everyone to this issue again: https://github.com/fluent/fluent-bit/issues/5235

It is not clear to me that there is a widespread use case that truly requires multiple filter definitions for the same logs. Please post in that issue if you do not understand or you think you have a use case that requires multiple filter definitions that match the same logs.

staniondaniel commented 2 years ago

5235

it is not clear to me what do you mean multiple filter definitions. Do you mean more of these here: multiline.parser java,go,python

or multiple config like the above in different section: `
[INPUT]

    multiline.parser docker,cri

[FILTER]

    multiline.parser java,go,python

`

I am using the above config because if i try to add the line to the INPUT section like this `

[INPUT]

    multiline.parser java,go,python

` It simply doesn't work and log is not merged

So this is why i am doing it in 2 stages:

PettitWesley commented 2 years ago

@staniondaniel Multiple fitler means like I show in the issue like this:

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser      go

[FILTER]
     name                  multiline
     match                 *
     multiline.key_content log
     multiline.parser      multiline-regex-test

Instead, the CORRECT way is what you show, to have multiple parsers in a comma list.

It simply doesn't work and log is not merged

that's interesting. In this case what you are doing is fine, you only have one filter definition. It's fine to use both the filter and tail input multiline.parser keys.

erenming commented 2 years ago

I found there is a new buffer config parameter, set it to off seems to avoid this issue. doc: https://docs.fluentbit.io/manual/pipeline/filters/multiline-stacktrace#configuration-parameters

[FILTER]
    Name                             multiline
    match                            kube.*
    multiline.key_content   log
    multiline.parser            java
    buffer                            off
PettitWesley commented 2 years ago

@erenming With buffer off, the filter will only work with a few inputs, mainly tail. As noted in the docs.

erenming commented 2 years ago

@PettitWesley Really thanks for your remind. Actually, we only use tail, so i think with buffer off will be okay. But i may found a issue with buffer off for these days, multiline seems not working :( This is my sample config:

[SERVICE]
    flush        0.5
    daemon       Off
    grace        5
    log_level    info
    parsers_file parsers.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

[INPUT]
    name                 tail
    path                 test.log
    DB                   flb_k8s.db
    Tag                  kube.*
    read_from_head       true
    #ignore_older         5m
    Buffer_Chunk_Size    1MB
    Buffer_Max_Size      10MB
    Skip_Long_Lines      On
    Skip_Empty_Lines     On
    Refresh_Interval     30
    Rotate_Wait          60
    DB.locking           true
    DB.journal_mode      WAL

    Mem_Buf_Limit        300MB
    multiline.parser     docker, cri

[FILTER]
    Name                   multiline
    match                  *
    multiline.key_content  log
    multiline.parser       java
    buffer                 Off

[OUTPUT]
    name stdout
    match *
    format json

and sample log:

{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:25:28.826167874Z"}
{"log":"io.terminus.trantorframework.exception.BusinessException: Schedule.request.param.is.illegal\n","stream":"stdout","time":"2021-08-16T08:24:28.826172954Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.util.ValidateUtil.expectedTrue(ValidateUtil.java:39)\n","stream":"stdout","time":"2021-08-16T08:24:28.82617719Z"}
{"log":"\u0009at io.terminus.gaaia.schedule.func.schedule.validate.ValidateScheduleParamForCalculationFuncImpl.execute(ValidateScheduleParamForCalculationFuncImpl.java:47)\n","stream":"stdout","time":"2021-08-16T08:24:28.826182475Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a406-4575af70a449,d5c28df2-32cc-4d7f-ab36-1c41334579c2] - [http-nio-8080-exec-377] i.t.t.exception.TrantorWebAdvice        : handleException error with request info: RequestInfo(path: [], referer: [null])\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"2021-08-16 16:25:28.818 ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}
{"log":"ERROR [gaia-requisition-runtime,ec316a26-0889-47d6-a\n","stream":"stdout","time":"2021-08-16T08:26:28.826167874Z"}

Is there any mistake in my configuration?

staniondaniel commented 2 years ago

Is there any mistake in my configuration? I believe he meant that you need to use the parser config directly in the INPUT section and not in the [FILTER] section . something like:

[INPUT]
name                 tail
......
Mem_Buf_Limit        300MB
multiline.parser     docker, cri, java

@PettitWesley as an update to the original problem. After updating to the latest Fluentbit version I noticed that using only the " multiline.parser docker,cri" in the INPUT section without any [FILTER] config like bellow:

    [INPUT]
        Name tail
......
        multiline.parser docker,cri

the pods don't crash anymore due to high MEMORY usage

As soon as I add the [FILTER] config as bellow:

   [FILTER]
        name multiline
        Emitter_Mem_Buf_Limit 2.4GB
        match kube.*
        multiline.key_content log
        multiline.parser java,go,python

The problem reoccurs and the memory slowly rises until it hits the buffer limit which causes the logs to be flooded with : [input:emitter:emitter_for_multiline.0] error registering chunk with tag: or it gets killed by kubernetes when memory usage of the pods hits the limits set for it

From what I noticed this is simply an issue of the multiline filter not being able to process the data fast enough or just not releasing the buffer once the data is processed which causes high cpu usage and leads to a crash eventually

rajeev-netomi commented 2 years ago

@PettitWesley We have been facing the same issue with aws-for-fluent-bit:2.25.0. Our custom fluentbit configuration is below.

[SERVICE] flush 1 storage.path /var/log/app-storage/ storage.sync normal [INPUT] Name forward unix_path /var/run/fluent.sock Mem_Buf_Limit 50MB storage.type filesystem [FILTER] name multiline match * multiline.key_content log multiline.parser docker,cri,java Emitter_Mem_Buf_Limit 50M emitter_storage.type filesystem

Firelens configuration

logConfiguration = { logDriver = "awsfirelens" options = { "Name": "datadog", "apikey": "", "Host": "http-intake.logs.datadoghq.com", "net.keepalive" : "false", "storage.total_limit_size" : "5G", "dd_service": "", "dd_source": "ecs-fargate-dev", "dd_message_key": "", "dd_tags": "env:dev", "TLS": "on", "provider": "ecs" } }

Logs do not show any error. It just fails with segmentation fault.

`--------------------------------------------------------------------------------------------------------------------------------------------------------- timestamp message
1656096324683 [2022/06/24 18:45:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096325650 [2022/06/24 18:45:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096326704 [2022/06/24 18:45:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096329689 [2022/06/24 18:45:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096331692 [2022/06/24 18:45:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096332695 [2022/06/24 18:45:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096333726 [2022/06/24 18:45:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096334685 [2022/06/24 18:45:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096336749 [2022/06/24 18:45:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096338648 [2022/06/24 18:45:38] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096344687 [2022/06/24 18:45:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096346709 [2022/06/24 18:45:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096347677 [2022/06/24 18:45:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096348703 [2022/06/24 18:45:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096349674 [2022/06/24 18:45:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096350678 [2022/06/24 18:45:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096351692 [2022/06/24 18:45:51] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096352721 [2022/06/24 18:45:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096353671 [2022/06/24 18:45:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096354685 [2022/06/24 18:45:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096355694 [2022/06/24 18:45:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096361716 [2022/06/24 18:46:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096363714 [2022/06/24 18:46:03] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096364712 [2022/06/24 18:46:04] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096365644 [2022/06/24 18:46:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096368704 [2022/06/24 18:46:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096369701 [2022/06/24 18:46:09] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096372686 [2022/06/24 18:46:12] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096375691 [2022/06/24 18:46:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096376685 [2022/06/24 18:46:16] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096378678 [2022/06/24 18:46:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096380698 [2022/06/24 18:46:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096383700 [2022/06/24 18:46:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096384679 [2022/06/24 18:46:24] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096386694 [2022/06/24 18:46:26] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096388690 [2022/06/24 18:46:28] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096390698 [2022/06/24 18:46:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096391696 [2022/06/24 18:46:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096392705 [2022/06/24 18:46:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096393696 [2022/06/24 18:46:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096395689 [2022/06/24 18:46:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096396682 [2022/06/24 18:46:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096399692 [2022/06/24 18:46:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096401703 [2022/06/24 18:46:41] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096403693 [2022/06/24 18:46:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096404708 [2022/06/24 18:46:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096405694 [2022/06/24 18:46:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096406682 [2022/06/24 18:46:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096410688 [2022/06/24 18:46:50] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096412651 [2022/06/24 18:46:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096413683 [2022/06/24 18:46:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096414671 [2022/06/24 18:46:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096416720 [2022/06/24 18:46:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096418662 [2022/06/24 18:46:58] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096419663 [2022/06/24 18:46:59] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096420682 [2022/06/24 18:47:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096421696 [2022/06/24 18:47:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096422655 [2022/06/24 18:47:02] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096425696 [2022/06/24 18:47:05] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096428686 [2022/06/24 18:47:08] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096430710 [2022/06/24 18:47:10] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096431661 [2022/06/24 18:47:11] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096434709 [2022/06/24 18:47:14] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096435655 [2022/06/24 18:47:15] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096437670 [2022/06/24 18:47:17] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096438655 [2022/06/24 18:47:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096439675 [2022/06/24 18:47:19] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096440691 [2022/06/24 18:47:20] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096441701 [2022/06/24 18:47:21] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096443747 [2022/06/24 18:47:23] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096445704 [2022/06/24 18:47:25] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096449695 [2022/06/24 18:47:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096450679 [2022/06/24 18:47:30] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096451681 [2022/06/24 18:47:31] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096452669 [2022/06/24 18:47:32] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096453645 [2022/06/24 18:47:33] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096454693 [2022/06/24 18:47:34] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096455683 [2022/06/24 18:47:35] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096456675 [2022/06/24 18:47:36] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096457698 [2022/06/24 18:47:37] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096459704 [2022/06/24 18:47:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096460664 [2022/06/24 18:47:40] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096462681 [2022/06/24 18:47:42] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096463700 [2022/06/24 18:47:43] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096464691 [2022/06/24 18:47:44] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096465694 [2022/06/24 18:47:45] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096466652 [2022/06/24 18:47:46] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096467653 [2022/06/24 18:47:47] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096468696 [2022/06/24 18:47:48] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096469747 [2022/06/24 18:47:49] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096472717 [2022/06/24 18:47:52] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096473665 [2022/06/24 18:47:53] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096474674 [2022/06/24 18:47:54] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096475721 [2022/06/24 18:47:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096476664 [2022/06/24 18:47:56] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096477653 [2022/06/24 18:47:57] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096480673 [2022/06/24 18:48:00] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096481704 [2022/06/24 18:48:01] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
1656096482634 [2022/06/24 18:48:02] [engine] caught signal (SIGSEGV)
1656096483998 AWS for Fluent Bit Container Image Version 2.25.0

---------------------------------------------------------------------------------------------------------------------------------------------------------`

PettitWesley commented 2 years ago

Sorry folks, I may not be able to take a look at this immediately.

If someone is willing to try these techniques to get a stack trace for me that would help speed up the fix: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#segfaults-and-crashes-sigsegv

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

rajeev-netomi commented 2 years ago

Not stale.

PRIHLOP commented 2 years ago

unstale

jujubetsz commented 2 years ago

Having same issue here. Container use all request causing high CPU load.

    service: |
      [SERVICE]
          Daemon Off
          Flush 1
          Log_Level info
          Parsers_File parsers.conf
          Parsers_File custom_parsers.conf
          HTTP_Server On
          HTTP_Listen 0.0.0.0
          HTTP_Port 2020
          Health_Check On

    inputs: |
      [INPUT]
          Name tail 
          Tag kube.*    
          Path /var/log/containers/*.log    
          DB /var/log/flb_kube.db   
          Mem_Buf_Limit 128MB   
          Skip_Long_Lines On    
          Refresh_Interval 10   
          Buffer_Chunk_Size 256k
          Buffer_Max_Size 100MB
          multiline.parser docker, cri

    filters: |
      [FILTER]
          Name kubernetes   
          Match kube.*  
          Kube_URL https://kubernetes.default.svc:443   
          Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt 
          Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token   
          Kube_Tag_Prefix kube.var.log.containers.  
          Merge_Log Off 
          Merge_Log_Key log_processado  
          Merge_Log_Trim On 
          Keep_Log Off  
          K8S-Logging.Parser Off    
          K8S-Logging.Exclude Off   
          Labels On 
          Annotations Off
      [FILTER]
          Name                                 multiline
          Match                                kube.*
          multiline.key_content                log
          multiline.parser                     go, java, python
          buffer                               On
          emitter_mem_buf_limit                64M

fluentbit 1.9.9

idanovo commented 1 year ago

I'm facing the same issue when working with the multiline parser with the rewrite-tag filter. I'm using the 0.21.6 version of the fluentbit helm chart that uses the 2.0.6 version of fluentbit.

I'm using the tail input:

    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-app*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

And using the re-write tag like this:

   [FILTER]
      Name    rewrite_tag
      Match   kube.*
      Rule    $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false

I can't really use the exclude filter in order to filter out these logs as I need them but I want them to go to other output than the other logs. Any idea how can I solve this issue? Is there any workaround for this?

This is my complete fluentbit configuration:

config:
  service: |
    [SERVICE]
      Daemon       Off
      Flush        {{ .Values.flush }}
      Log_Level    {{ .Values.logLevel }}
      Parsers_File parsers.conf
      Parsers_File custom_parsers.conf
      HTTP_Server  On
      HTTP_Listen 0.0.0.0
      HTTP_Port   {{ .Values.metricsPort }}
      Health_Check On

  inputs: |
    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-app-server*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

    [INPUT]
      Name              tail
      Tag               kube.*
      Path              /var/log/containers/my-other-app*.log
      multiline.parser  docker, cri
      Mem_Buf_Limit     10MB
      Refresh_Interval  10

  filters: |
    [FILTER]
      Name                kubernetes
      Match               kube.*
      Kube_URL            https://kubernetes.default.svc:443
      Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
      Merge_Log           On
      Merge_Log_Key       data
      K8S-Logging.Parser  On
      K8S-Logging.Exclude On

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes
      Add_prefix    kubernetes_

    [FILTER]
        Name          nest
        Match         kube.*
        Operation     lift
        Nested_under  kubernetes_labels
        Add_prefix    kubernetes_labels_

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_annotations
      Add_prefix    kubernetes_annotations_

    [FILTER]
      Name    rewrite_tag
      Match   kube.*
      Rule    $log .*\"log_audit\":true.*$ audit-logs.$kubernetes_namespace_name false

  outputs: |
    [OUTPUT]
      Name                s3
      Match               audit-logs.*
      bucket              my-bucket-name
      region              {{ .Values.region }}
      total_file_size     10M
      upload_timeout      1m
      use_put_object      on
      compression         gzip
      auto_retry_requests true
      s3_key_format       /namespace=$TAG[1]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
      role_arn            {{ .Values.fluentbit_role_arn }}

    [OUTPUT]
      Name                          s3
      Match                         kube.*
      bucket                        lakefs-audit-logs-{{ .Values.region }}-{{ .Values.environment }}
      region                        {{ .Values.region }}
      total_file_size               10M
      upload_timeout                1m
      use_put_object                on
      compression                   gzip
      auto_retry_requests           true
      s3_key_format_tag_delimiters ._
      s3_key_format                 /namespace=$TAG[5]/year=%Y/month=%m/day=%d/hour=%H/%H:%M:%S-$UUID.gz
      role_arn                      {{ .Values.fluentbit_role_arn }}

    [OUTPUT]
      Name                   loki
      Match                  kube.*
      Host                   {{ .Values.loki.endpoint }}
      port                   443
      tls                    on
      tls.verify             on
      http_user              {{ .Values.loki.username }}
      http_passwd            {{ .Values.loki.password }}
      auto_kubernetes_labels true
      Labels                 collector=fluent-bit, namespace=$kubernetes_namespace_name, pod=$kubernetes_pod_name, container=$kubernetes_container_name, Region={{ .Values.region }}
jujubetsz commented 1 year ago

I know it's ugly, but since i really need this working, i did setup a cronjob to rollout restart the fluent-bit pods every day at 03:00am.

rajeev-netomi commented 1 year ago

@PettitWesley Any idea if similar solution is supported for AWS ECS fargate as suggested by @jujubetsz ?

leonardo-albertovich commented 1 year ago

I'm currently making some unrelated improvements that involve modifying the multi line core machinery so if anyone is able to put together a somewhat reliable reproduction (it would be awesome if it was minimized but understandable if that's not possible) with sample input data I could look into this.

If you need help putting together the reproduction case Just tag me or here or talk to me in slack.

PettitWesley commented 1 year ago

@rajeev-netomi I don't recommend restarting FLB at a certain time if its crashing... I'm not sure of a good solution here but I think you want to restart after it crashes. Or see if you can find an older version which doesn't crash. Sorry!

rajeev-netomi commented 1 year ago

Yeah @PettitWesley. Basically marking fluentbit as a non-essential container so that if it crashes we do not stop the entire task and restart the non-essential container. This will be very useful for scenarios where we can tolerate loss of logging for the short duration.

leonardo-albertovich commented 1 year ago

On the contrary, if you need to stick with the version that crashes I would recommend preventively restarting the service. Think about it, when the binary crashes it means there probably was memory corruption and realistically speaking, the point in time where the invalid access caused the crash was was not the first so the longer you let it go, the more odds you have of whatever memory corruption issue there is affecting the integrity of the chunks (the data files that hold the ingested data while in transit).

So, ideally, I would recommend trying to downgrade to a version that doesn't crash but if you are forced to stick with the one, then I think you are better off actively recycling the pod rather and retaining control rather than waiting for it to "take care of itself".

I take that you don't have a reliable reproduction right?

rajeev-netomi commented 1 year ago

@leonardo-albertovich Currently there does not exist a stable build which does not have this issue. We have a very simple requirement that we want to concatenate multiline logs into single log event and for that we are using inbuilt filters. Below is the configuration file.

[FILTER] name multiline match * multiline.key_content log multiline.parser java

Dockerfile

FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:2.29.1 ADD java-multi-line.conf /fluent-bit/configs/java-multi-line.conf

This container is crashing for the applications where there is high volume of logs.

ahhda commented 1 year ago

For anyone facing similar issues, we tried a lot of different configurations including increasing memory limits but it didn't work. In the end, we had to replace fluent-bit with fluentd to make our logging work with the multiline parser.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

ryan65 commented 1 year ago

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 https://github.com/fluent/fluent-bit/issues/7782

I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

terencebor commented 1 year ago

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782

I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and "error registering chunk with tag" messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config. "buffer off" eliminates errors on version 2.1.2, but multiline filter stops to work with "buffer off" configured in [FILTER] despite used with tail input.

ryan65 commented 1 year ago

I have just Opened a related issue. I found that somehow we have a regression after upgrading to 2.1.2 from 1.9.7 #7782 I am wondering if others have noticed a similar issue. and if downgrading to 1.9.x fixes it for you.

Same issue: upgraded to 2.1.2 from 2.0.9 immediately high CPU, pod restarts and "error registering chunk with tag" messages. Issue is gone once downgrading back to 2.0.9. I have just one multiline [FILTER] with custom [MULTILINE_PARSER] in the config. "buffer off" eliminates errors on version 2.1.2, but multiline filter stops to work with "buffer off" configured in [FILTER] despite used with tail input.

Hi terence. Thanks for the update. Indeed you are right , I just checked 2.0.9 with my application and cpu is Ok. So seems the regression started after 2.0.9. Im really surprised no one is looking at this yet. I would say this is a critical bug. https://github.com/fluent/fluent-bit/issues/7782

terencebor commented 1 year ago

Tried fluentbit v2.1.9 - don't see the "error registering chunk" errors anymore. But I still have the high CPU problem.

ryan65 commented 1 year ago

Is there any update on the status of this issue . I would say this is pretty critical.

adl3n commented 1 year ago

Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I've tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.

ryan65 commented 1 year ago

Hello everyone. I still get an error on version 2.1.9. I request assistance in resolving this issue. I've tried various configurations, and the problem occurred after upgrading from 1.8.6 to 2.1.x.

I must say this is strange, that no one seems to be looking into this. Anyway I suggest you either return to your original version for now , or upgrade to 2.0.6. I think this is the last version that doesnt have this issue.

lucianosarra commented 1 year ago

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

ryan65 commented 1 year ago

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.

lucianosarra commented 1 year ago

I installed version 2.1.10 but I still have the problem of high CPU consumption. I downgraded to 1.9.10

Hi Luciano. From what I test, the problem started after 2.0.9 so if you like you can try that version. I wish someone would look into this already.

Hi Ryan, I confirm that 2.0.9 version doesn't have the CPU issue.

damien-jacinto commented 11 months ago

Same here, 2.0.9 works , tried 2.1.x without success, didn't tried 2.2.x yet

alexku7 commented 10 months ago

2.2.2 is still not working

ryan65 commented 10 months ago

2.2.2 is still not working

I'm absolutely shocked that no one is handling this issue. Quite worrying, We simply cant upgrade.

saurabhgcect commented 10 months ago

Any workaround or update on above issue.

braydonk commented 9 months ago

I took a look at this issue today.

The root problem: the internal in_emitter plugin that the multiline filter uses to emit the joined logs does not respect the FLB_INPUT_PAUSED state. This state is set and checked automatically in flb_input_chunk.c, and those routines will defensively refuse to append to chunks if the plugin is paused. This is not a problem for most input plugins, as they will simply pause and resume their collectors as the overall input instance is paused. This is not the case for the emitter, where any plugin can just append records to it at any time. So the multiline plugin indiscriminately slams the emitter with new logs, causing the emitter plugin to fail to append most of these and drop the logs. Worth noting that this is different from the documented behaviour in the emitter_mem_buf_limit field The emitter mem buf limit description from the link above I actually have no idea how to implement this behaviour; I couldn't find a way for a filter to wait for an input plugin to resume without blocking the entire program (thus deadlocking it essentially, since the emitter wouldn't be able to process the logs it has to make room so it can't ever resume).

This problem exists for both multiline and rewrite_tag, as their usage of emitter (just adding records directly to it) is the same.

I opened a PR shown above that cuts out the middleman and checks for the emitter to be paused before even trying to write to it. This will at least stop the spamming error log. Technically this issue is about high CPU, and this PR does not solve it. The multiline processor is very CPU intensive in general, so when it's run as a filter on the main thread it's no wonder it takes a whole CPU. I'm not sure what the solution to that is.


As a user, it is basically necessary with the design of this plugin to take steps to make the plugin resilient to higher throughputs. This is actually relatively the same as what you'd need to do to any input plugins, it's just a bit odd because you wouldn't think of needing to use those kinds of strategies on a filter plugin. However, to use this plugin effectively, it is essentially necessary to understand that under-the-hood detail of the emitter also being an input plugin which can suffer similar problems to any other input plugin.

Here are the solutions I would recommend, which are not dissimilar to any input plugin:

saurabhgcect commented 9 months ago

@braydonk : I switched emitter to filesystem buffering but running into another issues where Kubernetes Fluent Bit not recovering after Fluentd restart ,chunks were stuck in storage. My setup fluentbit(2.0.6)->fluentd(1.16.3)->ES/Kibana.After restarting the fluentbit all the chunks were processed from the storage. I enabled the debug mode and looks like flushing is very slow.I tried the below scenario

  1. Ran a log geneartor pod
  2. Configured Fluentbit to capture the log and forward to fluentd
  3. Stopped Fluentd service
  4. Checked the logs with debug mode enabled-- Fluentbit was keep on trying to connect to fluentd and in parallel chunks were getting written to filesystem.
  5. When the count of chunks reached to 3k+(around 2.5 GB),I started fluentd service.
  6. Chunks flushed to fluentd but it took around 5-6 hrs

I repeated the above same steps with one more addition of restarting the fluentbit service and saw chunks were processed in less than 10 minutes.

braydonk commented 9 months ago

@saurabhgcect This sounds unrelated to this issue. I'd recommend opening a separate issue with config and logs included.

braydonk commented 9 months ago

8473 is the solution that should be used for this. I've closed my PR in favour of it.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 6 months ago

This issue was closed because it has been stalled for 5 days with no activity.

braydonk commented 6 months ago

I believe this should be fixed as of v3.0.3 which has #8473. I'll re-open and close this properly as the issue should be fixed, but if it is not then feel free to re-open with details.