fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.73k stars 1.56k forks source link

Fluent-bit stops processing logs under high load, locks writing in ch_manager pipe #2661

Open grep4error opened 3 years ago

grep4error commented 3 years ago

Bug Report

My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters. Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port. After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe. Here’s the stack trace

0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fac5aed74a7 in write () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055a127ecad01 in flb_output_return (ret=1, th=0x7fac548cd240) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:545
#2  0x000055a127ecade9 in flb_output_return_do (x=1) at /tmp/fluent-bit/include/fluent-bit/flb_output.h:576
#3  0x000055a127eccca1 in cb_es_flush (data=0x7fac5991e0b2, bytes=3842,
    tag=0x7fac5437fd20 "kube.gws.gws-platform-datacollector-blue-6c474d7c84-wsdcw.gws-platform-datacollector-blue.2db22adbd3daaaf836a3f6311f4b3e5ad9ec7727280458ac68868419fb758ab9", tag_len=154,
    ins=0x7fac5949e480, out_context=0x7fac56c6bc00, config=0x7fac59432c80) at /tmp/fluent-bit/plugins/out_es/es.c:748
#4  0x000055a127e72649 in output_pre_cb_flush () at /tmp/fluent-bit/include/fluent-bit/flb_output.h:449
#5  0x000055a1282a6907 in co_init () at /tmp/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x3039663238613165 in ?? ()
#

strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)

...
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\200\25\20\2\0\0\0", 8)    = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\200\25\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
write(14, "\0\200\17\20\2\0\0\0", 8)    = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\0\33\20\2\0\0\0", 8)       = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = 8
read(13, "\0\200\17\20\2\0\0\0", 8)     = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
write(14, "\0\300\24\20\2\0\0\0", 8)    = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0@\16\20\2\0\0\0", 8)       = 8
read(13, "\0\0\27\20\2\0\0\0", 8)       = 8
write(14, "\0\0\33\20\2\0\0\0", 8)      = 8
read(13, "\0\0\36\20\2\0\0\0", 8)       = 8
read(13, "\0\300\37\20\2\0\0\0", 8)     = 8
write(14, "\0\0\27\20\2\0\0\0", 8)      = 8
write(14, "\0\0\36\20\2\0\0\0", 8)      = 8
read(13, "\0\300\r\20\2\0\0\0", 8)      = 8
write(14, "\0\300\37\20\2\0\0\0", 8)    = 8
read(13, "\0\300\24\20\2\0\0\0", 8)     = 8
read(13, "\0\200\1\20\2\0\0\0", 8)      = 8
read(13, "\0@\16\20\2\0\0\0", 8)        = 8
write(14, "\0\200\1\20\2\0\0\0", 8)     = 8
write(14, "\0\300\r\20\2\0\0\0", 8)     = ? ERESTARTSYS (To be restarted if SA_RESTART is set)

it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.

I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.

flb_engine.c:

    /*
     * Create a communication channel: this routine creates a channel to
     * signal the Engine event loop. It's useful to stop the event loop
     * or to instruct anything else without break.
     */
    ret = mk_event_channel_create(config->evl,
                                  &config->ch_manager[0],
                                  &config->ch_manager[1],
                                  config);
    if (ret != 0) {
        flb_error("[engine] could not create manager channels");
        return -1;
    }

    flb_pipe_set_nonblocking(&config->ch_manager[1]); /* <----- I made it non-blocking ------- */

there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.

Environment

Version used: 1.5.7 (container fluent/fluent-bit:1.5.7) kubernetes 1.16.13 docker 3.0.10+azure Ubuntu 16.04.1

avdhoot commented 3 years ago

Awesome Thanks @grep4error for all help. @edsiper This affecting more people https://github.com/fluent/fluent-bit/issues/2621 & https://github.com/fluent/fluent-bit/issues/2577. Please do the needful. We can help to validate fix. cc @tomerleib

tomerleib commented 3 years ago

I've done two more tests in this area:

  1. Downgraded the image to 1.3.6 - There was a slight improvement, only one pod encountered the issue and froze. 2/3 pods of the Daemonset continue to work.
  2. I've removed all the tails and left a single input and a single stdout output - No issues, Fluent-Bit is running for almost 3 days and nothing stopped.
tomerleib commented 3 years ago

@grep4error do you have an image that you can share with me and I will test it in my environment as well?

tomerleib commented 3 years ago

edited No changes for me with 1.5.7 and this line. Still stopped processing after 20-30 minutes.

However, I created an image for 1.7.0 (yes, I know it's not released, but I forgot to change the branch) and everything worked for more than 14 hours.

mtparet commented 3 years ago

We have the same issue, for now we will revert to 1.3.X which seems not having this issue.

tomerleib commented 3 years ago

@mtparet which 1.3.x version have you tested? I tested with 1.3.7 and saw the same issue, although not entirely (2/3 pods are running and 1 froze) but still.

mtparet commented 3 years ago

It was 1.3.9, I cannot guarantee we had no freeze but I did not observed it.

tomerleib commented 3 years ago

Tested 1.3.9, after 28 hours one of the fluent-bit pods froze again...

mtparet commented 3 years ago

Hello @edsiper, Do you acknowledge the issue and so we need to find a fix for it ? Or this issue should not happen and there is something else outside of fluentbit that is broken.

edsiper commented 3 years ago

FYI: I am taking a look at this (WIP)

edsiper commented 3 years ago

I am thinking about a solution, making the socket async is right, but when EAGAIN is returned when trying to write to the pipe in a full state, will need an extra care, since that notification from the output saying "I am done, or need a retry", will be missed.

Work in process

edsiper commented 3 years ago

Indeed at high load ch_manager get's saturated, the obvious solution seems to implement a pool of pipes/channels for the output plugins. Now we were abusing the internal event loop channel manager and it was quite OK until a certain load, I will work in a POC with independent notification channels.

Work in process

mtparet commented 3 years ago

Thanks a lot of clarifying and your work on that ! @edsiper

mtparet commented 3 years ago

Indeed at high load ch_manager get's saturated

Increasing CPU allocated could be a workaround ?

edsiper commented 3 years ago

not necessary, it's a bit to have separate channels for that specific kind of notifications.

edsiper commented 3 years ago

@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs

I've pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.

About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine ch_manager.

edsiper commented 3 years ago

ping, any feedback ?

mtparet commented 3 years ago

I didn't take the time to build/push/test it, I will.

tpetrov-lp commented 3 years ago

@edsiper do you have any recommendations for a workaround when run on kubernetes? Basically, detecting a freeze and restarting is some sort of a temporary solution. One option seems to be creating an additional file output and configuring the k8s livecheck to verify if this file is regularly updated. If not, then restart. Does this sound like a viable option to you? Or maybe something like that relying on the prometheus metrics: https://github.com/fluent/fluent-bit/issues/2591

        livenessProbe:
          exec:
            command:
            - /bin/bash
            - -exc
            - >
              function getOutTotalF0() { curl -s --connect-timeout 5 --retry 3 -f http://localhost:65012/api/v1/metrics/prometheus |grep '^fluentbit_output_proc_records_total{name="forward.0"}' | awk '{print $2}' ;};
              T=$(getOutTotalF0);test -e ./m_oprtf0 && { test $(cat ./m_oprtf0) -ge $T && { echo "OutputRecordsTotal Forward0 freeze: $T"; exit 1;} ;}; test $T -ge -10 && { echo "$T" >| ./m_oprtf0; } || true;
          initialDelaySeconds: 120
          periodSeconds: 60
          failureThreshold: 8
          timeoutSeconds: 5
avdhoot commented 3 years ago

Right now I do not have environment (high load means every min pod coming & going) So not able to validate the fix. Sorry cc @edsiper

tomerleib commented 3 years ago

@edsiper Sorry for delaying, I've started to test it. I will let it run for a couple of days, once I'll have an insight I will update you.

mnaser commented 3 years ago

@edsiper We're running into that same issue. I'm building the Docker image and deploying it here too and will try to report back. If anyone wants to quickly consume it, we've published it to our registry:

registry.public.mtl1.vexxhost.net/openstack/fluent-bit:1.6.6-1

If you're using the Helm charts, this should do:

image:
  repository: registry.public.mtl1.vexxhost.net/openstack/fluent-bit
  tag: 1.6.6-1
tomerleib commented 3 years ago

@edsiper here's a quick update. high load, 3 nodes cluster. The DS is running for 10 hours, currently, 2/3 of the pods are still processing logs. 1 pod has stopped processing any log after 8 hours, nothing in the log and nothing in K8s events for this pod. Another pod that still runs, reported that it was terminated some time ago due to error code 135.

mtparet commented 3 years ago

Version 1.6.6 is deployed in our production (13 big nodes) since yesterday, we will wait in 2-3 days if the fix is working for us.

tomerleib commented 3 years ago

I will start testing with v1.6.6 as well and see if it's working or not, I believe that by tomorrow I will know for sure if at least one stopped working.

mtparet commented 3 years ago

So far, there are no more freezes on our side.

tomerleib commented 3 years ago

I will start testing it simultaneously, first with @mnaser image. @mtparet can you share the image that you're using? In my environment when I've built the image, if stopped working for this.

mtparet commented 3 years ago

@tomerleib I am using 1.6.6 version with this configuration :

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_Tag_Prefix     kube.var.log.containers.
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Parser            docker
    Tag               kube.*
    Refresh_Interval  5
    Mem_Buf_Limit     6MB
    Skip_Long_Lines   On
    read_from_head    on
    DB                /tail-db/tail-containers-state.db
    DB.Sync           Off
    DB.locking        true
[OUTPUT]
    Name  es
    Match *
    Host  XXXX
    Port  443
    Logstash_Format On
    Retry_Limit 5
    Type  _doc
    Trace_Error true
    Time_Key @timestamp-flb
    Replace_Dots On

    HTTP_User XXX
    HTTP_Passwd XXXX

    tls on
    tls.verify on
mnaser commented 3 years ago

Small clarification, when folks here are saying they're using 1.6.6 -- as in upstream official 1.6.6 or our patched version? I think that would help the maintainers. My decision to use 1.6.6-1 is that the assumption that it is 1.6.6 + 1 patch.

Also, 1.6.7 was released officially a few hours ago too.

cabrinha commented 3 years ago

Small clarification, when folks here are saying they're using 1.6.6 -- as in upstream official 1.6.6 or our patched version? I think that would help the maintainers. My decision to use 1.6.6-1 is that the assumption that it is 1.6.6 + 1 patch.

Also, 1.6.7 was released officially a few hours ago too.

Now, 1.6.8 is out :D

tomerleib commented 3 years ago

@edsiper So far the image that @mnaser created is working. Each one of the three pods is still running:

NAME                       READY   STATUS    RESTARTS   AGE
fluent-bit-sisense-b48mw   1/1     Running   0          4d3h
fluent-bit-sisense-dxjk4   1/1     Running   0          4d3h
fluent-bit-sisense-z84fj   1/1     Running   0          4d3h

An average of 100kb/s for incoming/outgoing bytes per pod.

jwitrick commented 3 years ago

I am still facing this issue. My setup is:

FluentBit v1.6.8 (installed via Helm chart version: 0.7.13).

My fluent-bit pods will ramp up to using about 100MiB+ (currently i have 1 almost 500), and the pods will just be outputting lines:

  [2020/12/23 12:19:03] [ warn] [engine] failed to flush chunk '1-1608725833.206221193.flb', retry in 503 seconds: task_id=32, input=tail.0 > output=es.0
  [2020/12/23 12:19:05] [ warn] [engine] failed to flush chunk '1-1608725859.437551097.flb', retry in 16 seconds: task_id=69, input=tail.0 > output=es.0

My setup: EKS: 1.16 (old i know...)

My FluentBit Service:

         [SERVICE]
            Flush 1
            Daemon Off
            Log_level warn
            Parsers_File parsers.conf
            Parsers_File custom_parsers.conf
            HTTP_Server On
            HTTP_Listen 0.0.0.0
            HTTP_Port 2020

inputs:

        [INPUT]
            Name tail
            Path /var/log/containers/*.log
            Parser docker_decode_json
            Tag kube.*
            Mem_Buf_Limit 30MB
            Skip_Long_Lines On
        [INPUT]
            Name tcp
            Tag tcp.*
            Chunk_size 32
            Buffer_size 64
            Body_size_limit 32m
            Keepalive_timeout 10s
        [INPUT]
            Name systemd
            Tag host.*
            Systemd_Filter _SYSTEMD_UNIT=kubelet.service
            Read_From_Tail On

filters:

        [FILTER]
            Name kubernetes
            Match kube.*
            Merge_Log Off
            Keep_Log Off
            K8S-Logging.Parser On
            K8S-Logging.Exclude On

outputs:

        [OUTPUT]
            Name es
            Log_level warn
            Match kube.*
            Host  <AWS.ES.CLUSTER>
            Port 443
            Buffer_Size False
            Replace_Dots On
            Logstash_Format On
            Retry_Limit False
            TLS On
        [OUTPUT]
            Name es
            Match host.*
            Host <AWS.ES.CLUSTER>
            Port 443
            Buffer_Size False
            Replace_Dots On
            Logstash_Format On
            Logstash_Prefix node
            Retry_Limit False
            TLS On
        [OUTPUT]
            Name es
            Log_level warn
            Match tcp.*
            Host  <AWS.ES.CLUSTER>
            Port 443
            Buffer_Size False
            Replace_Dots On
            Logstash_Format On
            Logstash_Prefix fulcrum
            Retry_Limit False
            TLS On

Custom Parsers:

        [PARSER]
            Name docker_no_time
            Format json
            Time_Keep Off
            Time_Key time
            Time_Format %Y-%m-%dT%H:%M:%S.%L
        [PARSER]
            Name docker_decode_json
            Format json
            Time_Key time
            Time_Format %Y-%m-%dT%H:%M:%S.%L
            Time_Keep On
            Decode_Field json log

Every day it seems that 1 (or more) pods will just begin to throw the above errors. I have verified that the input queue on teh ES side is not full/rejecting, so atm I am enabling error logging in my ElasticSearch Cluster but the same code will work for a while and then just start having issues.

agup006 commented 3 years ago

Hey folks, one update is our next release is focused on increasing high performance - we do have some preview versions available if you are able to easily test them out and want to try let me know. I'm @(Anurag Gupta) in the Fluentd Slack

servo1x commented 3 years ago

Any updates?

Issues that could be related: #2830 #2950 #3006 #3039 #3014

agup006 commented 3 years ago

Hey folks, updating 1.7.0 is out with higher performance, will also look at some of the issues that @servo1x has mentioned

elruwen commented 3 years ago

Hey folks, updating 1.7.0 is out with higher performance, will also look at some of the issues that @servo1x has mentioned

Problem persists with 1.7.0 :(

shirolimit commented 3 years ago

I can confirm that the problem still persists in 1.7.2.

flb-pipeline thread is hanged with stack trace similar to:

#0  0x00007fa9c3add4a7 in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000561fc44468e1 in flb_output_return (ret=1, co=0x7fa93aadc320) at /tmp/src/include/fluent-bit/flb_output.h:594
#2  0x0000561fc444693d in flb_output_return_do (x=1) at /tmp/src/include/fluent-bit/flb_output.h:630
#3  0x0000561fc4446a20 in cb_null_flush (data=0x7fa92b4b64c0, bytes=30805, tag=0x7fa93ac36e20 "kube.var.log.containers.97e67b65-2f98-436b-b0c4-6816c02bf8f1-jb7jp_default_donkey-43dd588ae731dee61eea1b74495b1293edac39c6d35cb446aef0b833282c831b.log", tag_len=150, i_ins=0x7fa9c208b300, out_context=0x7fa9c2075500, config=0x7fa9c2032900) at /tmp/src/plugins/out_null/null.c:50
#4  0x0000561fc43b757b in output_pre_cb_flush () at /tmp/src/include/fluent-bit/flb_output.h:466

The output may vary, in this case it is simple null.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

elruwen commented 3 years ago

This is still an issue. Happy to help outh with debug builds etc

pierluigilenoci commented 3 years ago

@elruwen also with 1.7.3?

Caroga commented 3 years ago

I have limited experience with fluentbit but stumbled across this thread as I am experiencing the same issues. Anything I could do for debugging/insights?

edsiper commented 3 years ago

please supply your config files

elruwen commented 3 years ago

@elruwen also with 1.7.3?

Sorry, I mixed up the issues. It got an issue very similar to this, see https://github.com/fluent/fluent-bit/issues/3014 That one is still with 1.7.3 an issue and I wouldn't be surprised if those two issues are related. The issue contains everything to reproduce it.

frankreno commented 2 years ago

Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won't have the possibility of missing logs.

danlenar commented 2 years ago

Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it.

So, this issue might manifest depending on what the pipe buffer size is.

You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel's pipe buffer soft limits

Default pipe buffer size on Linux is going to be 65536 bytes. Since the max tasks you can run is 2048, this pipe size should easily handle all the tasks easily without the pipe being blocked.

If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797). On centos/rhel 7/8, your pipe size would be 4096 bytes, while newer kernels would be 8192 bytes. So, you can run into a scenario where you add too many tasks/coroutines at once and the pipe gets blocked on writing. This really means that the pipe can only really handle writing 512 or 1024 tasks and doesn't have room for current max task size of 2048.

As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes.

A proper solution would be to alter the amount of tasks that can be run based on the pipe's buffer.

danlenar commented 2 years ago

Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. I ran into into this issue that has been fixed in the Linux kernel 5.14 (https://github.com/torvalds/linux/commit/46c4c9d1beb7f5b4cec4dd90e7728720583ee348) where you may still block on a pipe write even if the pipe is not fully full.

With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.

zhanghe9702 commented 2 years ago

Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. I ran into into this issue that has been fixed in the Linux kernel 5.14 (torvalds/linux@46c4c9d) where you may still block on a pipe write even if the pipe is not fully full.

With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.

i have tried the 500 tasks, fluent-bit block also exist(but work normal after restart it), maybe coroutine scheduler algorithm has some bugs:)?

annettejanewilson commented 2 years ago

I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:

Thread 8 (Thread 0x7f25b61fd700 (LWP 13)):
#0  0x00007f25c231c459 in write () from /lib64/libpthread.so.0
#1  0x00000000004ff3e7 in flb_output_return (ret=1, co=0x7f25b5217308) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:649
#2  0x00000000004ff443 in flb_output_return_do (x=1) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:685
#3  0x00000000004ff4eb in flb_proxy_cb_flush (data=0x7f257439cd80, bytes=790, tag=0x7f25b41cab80 "kube.var.log.containers.off-boarding-manager-6aapa-74c94c54b5-k2nz5_apollo_off-boarding-manager-3477301c92e8e9da2d9c6769983a2c06107c2ebf139548995e7aca30777f639d.log", tag_len=164, i_ins=0x7f25bda3a680, out_context=0x7f25bda22330, config=0x7f25bda19980) at /tmp/fluent-bit-1.8.9/src/flb_plugin_proxy.c:65
#4  0x00000000004d6973 in output_pre_cb_flush () at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:514
#5  0x00000000009d3667 in co_init () at /tmp/fluent-bit-1.8.9/lib/monkey/deps/flb_libco/amd64.c:117

If we detach, wait and reattach it is still stuck there.

We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both.

When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it's on a node that is hosting particularly busy pods or particularly many pods but we're not sure.

We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem.

This is our current configuration:

    [SERVICE]
        Flush         1
        Log_Level     ${LOG_LEVEL}
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              ${PATH}
        Parser            ${LOG_PARSER}
        DB                ${FB_DB}
        Mem_Buf_Limit     200Mb
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Use_Kubelet         On
        Buffer_Size         2Mb
        K8S-Logging.Exclude On
        K8S-Logging.Parser  On
        Merge_Log           On
        Keep_Log            Off
        tls.verify          On

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  kubernetes

    [FILTER]
        Name          nest
        Match         *
        Operation     lift
        Nested_under  labels
        Add_prefix    labels.

    [FILTER]
        Name             modify
        Match            *
        Rename           labels.app                       service.name
        Rename           labels.app.kubernetes.io/name    service.name
        Rename           labels.app.kubernetes.io/version service.version
        Rename           trace_id                         trace.id
        Remove_wildcard  labels.
        Remove_wildcard  annotations
        Remove           container_hash
        Remove           docker_id
        Remove           pod_id

    [FILTER]
        Name           record_modifier
        Match          *
        Record         cluster_name ${CLUSTER_NAME}

    [OUTPUT]
        Name           newrelic
        Match          *
        licenseKey     ${LICENSE_KEY}
        endpoint       ${ENDPOINT}
        lowDataMode    true

I'll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output

Some of the relevant environment variables:

        - name: LOG_LEVEL
          value: info
        - name: LOG_PARSER
          value: docker
        - name: FB_DB
          value: /var/log/flb_kube.db
        - name: PATH
          value: /var/log/containers/*.log

The stack trace above is from 1.8.9 with Amazon's patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too.

annettejanewilson commented 2 years ago

The soft pipe limit is per-user, right? As I understand it, Kubernetes doesn't namespace users, so if Fluent Bit is running inside a container as uid 0, and lots of other processes are all running inside containers also as uid 0, then are they all sharing that same limit? That could explain why it is hard to reproduce outside of production - it needs something else running as the same uid to consume all of the soft pipe limit, and it needs heavy load to fill up the queue.

annettejanewilson commented 2 years ago

I spent a while trying to make an isolated reproduction, but I haven't been successful yet. I am using one process running as the same user to allocate pipes and using fcntl with F_SETPIPE_SZ to resize them until hitting the limit, at which point all new pipes allocated can be observed (via fcntl with F_GETPIPE_SZ) to be 4096 bytes (one page). At this point I think what is required to reproduce is to generate a sustained backlog of tasks, but I haven't figured out how to do this in an isolated setup that doesn't depend on Kubernetes or New Relic.

I note @danlenar's observation earlier, that it's possible to block on a write to pipe even if it's not empty. I would make this even more explicit - it's possible to block on a write to a pipe if the unused capacity is at least one byte and less than 4096 bytes. Once the pipe writer reaches a page boundary, it needs a completely free page that does not still contain any data waiting to be read from the pipe. You can observe this behaviour by creating a pipe, resizing it to 4096 bytes, writing 4096 bytes, reading 4095 bytes, then attempting to write 1 byte. Even though the pipe has only a single unread byte remaining, writing even a single byte into the pipe will block until that byte has been read.

Based on this, I think it should be possible to reproduce by creating the soft pipe limit scenario on a machine with kernel <5.1.4, and finding some configuration that spends a meaningful proportion of time with a task queue of two or more tasks. Sooner or later two or more tasks should attempt to write to the (single page capacity) pipe as it is crossing the page boundary.