Open grep4error opened 3 years ago
Awesome Thanks @grep4error for all help. @edsiper This affecting more people https://github.com/fluent/fluent-bit/issues/2621 & https://github.com/fluent/fluent-bit/issues/2577. Please do the needful. We can help to validate fix. cc @tomerleib
I've done two more tests in this area:
@grep4error do you have an image that you can share with me and I will test it in my environment as well?
edited No changes for me with 1.5.7 and this line. Still stopped processing after 20-30 minutes.
However, I created an image for 1.7.0 (yes, I know it's not released, but I forgot to change the branch) and everything worked for more than 14 hours.
We have the same issue, for now we will revert to 1.3.X which seems not having this issue.
@mtparet which 1.3.x version have you tested? I tested with 1.3.7 and saw the same issue, although not entirely (2/3 pods are running and 1 froze) but still.
It was 1.3.9, I cannot guarantee we had no freeze but I did not observed it.
Tested 1.3.9, after 28 hours one of the fluent-bit pods froze again...
Hello @edsiper, Do you acknowledge the issue and so we need to find a fix for it ? Or this issue should not happen and there is something else outside of fluentbit that is broken.
FYI: I am taking a look at this (WIP)
I am thinking about a solution, making the socket async is right, but when EAGAIN is returned when trying to write to the pipe in a full state, will need an extra care, since that notification from the output saying "I am done, or need a retry", will be missed.
Work in process
Indeed at high load ch_manager
get's saturated, the obvious solution seems to implement a pool of pipes/channels for the output plugins. Now we were abusing the internal event loop channel manager and it was quite OK until a certain load, I will work in a POC with independent notification channels.
Work in process
Thanks a lot of clarifying and your work on that ! @edsiper
Indeed at high load ch_manager get's saturated
Increasing CPU allocated could be a workaround ?
not necessary, it's a bit to have separate channels for that specific kind of notifications.
@grep4error @avdhoot @tomerleib @mtparet @shirolimit @clebs
I've pushed a possible solution in branch ch_manager, would you please build it and test it? If you confirm the fix is good to go, I can merge it in the 1.6 series this week, but I need your help with proper feedback.
About the solution: now every output plugin instance has it owns channels to notify events through the event loop, in this way it does not saturate the main engine ch_manager
.
ping, any feedback ?
I didn't take the time to build/push/test it, I will.
@edsiper do you have any recommendations for a workaround when run on kubernetes? Basically, detecting a freeze and restarting is some sort of a temporary solution. One option seems to be creating an additional file output and configuring the k8s livecheck to verify if this file is regularly updated. If not, then restart. Does this sound like a viable option to you? Or maybe something like that relying on the prometheus metrics: https://github.com/fluent/fluent-bit/issues/2591
livenessProbe:
exec:
command:
- /bin/bash
- -exc
- >
function getOutTotalF0() { curl -s --connect-timeout 5 --retry 3 -f http://localhost:65012/api/v1/metrics/prometheus |grep '^fluentbit_output_proc_records_total{name="forward.0"}' | awk '{print $2}' ;};
T=$(getOutTotalF0);test -e ./m_oprtf0 && { test $(cat ./m_oprtf0) -ge $T && { echo "OutputRecordsTotal Forward0 freeze: $T"; exit 1;} ;}; test $T -ge -10 && { echo "$T" >| ./m_oprtf0; } || true;
initialDelaySeconds: 120
periodSeconds: 60
failureThreshold: 8
timeoutSeconds: 5
Right now I do not have environment (high load means every min pod coming & going) So not able to validate the fix. Sorry cc @edsiper
@edsiper Sorry for delaying, I've started to test it. I will let it run for a couple of days, once I'll have an insight I will update you.
@edsiper We're running into that same issue. I'm building the Docker image and deploying it here too and will try to report back. If anyone wants to quickly consume it, we've published it to our registry:
registry.public.mtl1.vexxhost.net/openstack/fluent-bit:1.6.6-1
If you're using the Helm charts, this should do:
image:
repository: registry.public.mtl1.vexxhost.net/openstack/fluent-bit
tag: 1.6.6-1
@edsiper here's a quick update. high load, 3 nodes cluster. The DS is running for 10 hours, currently, 2/3 of the pods are still processing logs. 1 pod has stopped processing any log after 8 hours, nothing in the log and nothing in K8s events for this pod. Another pod that still runs, reported that it was terminated some time ago due to error code 135.
Version 1.6.6 is deployed in our production (13 big nodes) since yesterday, we will wait in 2-3 days if the fix is working for us.
I will start testing with v1.6.6 as well and see if it's working or not, I believe that by tomorrow I will know for sure if at least one stopped working.
So far, there are no more freezes on our side.
I will start testing it simultaneously, first with @mnaser image. @mtparet can you share the image that you're using? In my environment when I've built the image, if stopped working for this.
@tomerleib I am using 1.6.6 version with this configuration :
[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.var.log.containers.
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 6MB
Skip_Long_Lines On
read_from_head on
DB /tail-db/tail-containers-state.db
DB.Sync Off
DB.locking true
[OUTPUT]
Name es
Match *
Host XXXX
Port 443
Logstash_Format On
Retry_Limit 5
Type _doc
Trace_Error true
Time_Key @timestamp-flb
Replace_Dots On
HTTP_User XXX
HTTP_Passwd XXXX
tls on
tls.verify on
Small clarification, when folks here are saying they're using 1.6.6 -- as in upstream official 1.6.6 or our patched version? I think that would help the maintainers. My decision to use 1.6.6-1 is that the assumption that it is 1.6.6 + 1 patch.
Also, 1.6.7 was released officially a few hours ago too.
Small clarification, when folks here are saying they're using 1.6.6 -- as in upstream official 1.6.6 or our patched version? I think that would help the maintainers. My decision to use 1.6.6-1 is that the assumption that it is 1.6.6 + 1 patch.
Also, 1.6.7 was released officially a few hours ago too.
Now, 1.6.8 is out :D
@edsiper So far the image that @mnaser created is working. Each one of the three pods is still running:
NAME READY STATUS RESTARTS AGE
fluent-bit-sisense-b48mw 1/1 Running 0 4d3h
fluent-bit-sisense-dxjk4 1/1 Running 0 4d3h
fluent-bit-sisense-z84fj 1/1 Running 0 4d3h
An average of 100kb/s for incoming/outgoing bytes per pod.
I am still facing this issue. My setup is:
FluentBit v1.6.8 (installed via Helm chart version: 0.7.13).
My fluent-bit pods will ramp up to using about 100MiB+ (currently i have 1 almost 500), and the pods will just be outputting lines:
[2020/12/23 12:19:03] [ warn] [engine] failed to flush chunk '1-1608725833.206221193.flb', retry in 503 seconds: task_id=32, input=tail.0 > output=es.0
[2020/12/23 12:19:05] [ warn] [engine] failed to flush chunk '1-1608725859.437551097.flb', retry in 16 seconds: task_id=69, input=tail.0 > output=es.0
My setup: EKS: 1.16 (old i know...)
My FluentBit Service:
[SERVICE]
Flush 1
Daemon Off
Log_level warn
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
inputs:
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker_decode_json
Tag kube.*
Mem_Buf_Limit 30MB
Skip_Long_Lines On
[INPUT]
Name tcp
Tag tcp.*
Chunk_size 32
Buffer_size 64
Body_size_limit 32m
Keepalive_timeout 10s
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On
filters:
[FILTER]
Name kubernetes
Match kube.*
Merge_Log Off
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
outputs:
[OUTPUT]
Name es
Log_level warn
Match kube.*
Host <AWS.ES.CLUSTER>
Port 443
Buffer_Size False
Replace_Dots On
Logstash_Format On
Retry_Limit False
TLS On
[OUTPUT]
Name es
Match host.*
Host <AWS.ES.CLUSTER>
Port 443
Buffer_Size False
Replace_Dots On
Logstash_Format On
Logstash_Prefix node
Retry_Limit False
TLS On
[OUTPUT]
Name es
Log_level warn
Match tcp.*
Host <AWS.ES.CLUSTER>
Port 443
Buffer_Size False
Replace_Dots On
Logstash_Format On
Logstash_Prefix fulcrum
Retry_Limit False
TLS On
Custom Parsers:
[PARSER]
Name docker_no_time
Format json
Time_Keep Off
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
[PARSER]
Name docker_decode_json
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
Decode_Field json log
Every day it seems that 1 (or more) pods will just begin to throw the above errors. I have verified that the input queue on teh ES side is not full/rejecting, so atm I am enabling error logging in my ElasticSearch Cluster but the same code will work for a while and then just start having issues.
Hey folks, one update is our next release is focused on increasing high performance - we do have some preview versions available if you are able to easily test them out and want to try let me know. I'm @(Anurag Gupta) in the Fluentd Slack
Any updates?
Issues that could be related: #2830 #2950 #3006 #3039 #3014
Hey folks, updating 1.7.0 is out with higher performance, will also look at some of the issues that @servo1x has mentioned
Hey folks, updating 1.7.0 is out with higher performance, will also look at some of the issues that @servo1x has mentioned
Problem persists with 1.7.0 :(
I can confirm that the problem still persists in 1.7.2.
flb-pipeline
thread is hanged with stack trace similar to:
#0 0x00007fa9c3add4a7 in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x0000561fc44468e1 in flb_output_return (ret=1, co=0x7fa93aadc320) at /tmp/src/include/fluent-bit/flb_output.h:594
#2 0x0000561fc444693d in flb_output_return_do (x=1) at /tmp/src/include/fluent-bit/flb_output.h:630
#3 0x0000561fc4446a20 in cb_null_flush (data=0x7fa92b4b64c0, bytes=30805, tag=0x7fa93ac36e20 "kube.var.log.containers.97e67b65-2f98-436b-b0c4-6816c02bf8f1-jb7jp_default_donkey-43dd588ae731dee61eea1b74495b1293edac39c6d35cb446aef0b833282c831b.log", tag_len=150, i_ins=0x7fa9c208b300, out_context=0x7fa9c2075500, config=0x7fa9c2032900) at /tmp/src/plugins/out_null/null.c:50
#4 0x0000561fc43b757b in output_pre_cb_flush () at /tmp/src/include/fluent-bit/flb_output.h:466
The output may vary, in this case it is simple null
.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This is still an issue. Happy to help outh with debug builds etc
@elruwen also with 1.7.3?
I have limited experience with fluentbit but stumbled across this thread as I am experiencing the same issues. Anything I could do for debugging/insights?
please supply your config files
@elruwen also with 1.7.3?
Sorry, I mixed up the issues. It got an issue very similar to this, see https://github.com/fluent/fluent-bit/issues/3014 That one is still with 1.7.3 an issue and I wouldn't be surprised if those two issues are related. The issue contains everything to reproduce it.
Seeing this issue as well. Any updates on this? Any information we can capture and provide to attempt to help get this issue moving towards a fix? Looks like it has been around for several releases and makes fluent-bit unreliable in high load but also unpredictably it seems. We are seeing this in multiple Kubernetes environments and there is no viable work around to detect that won't have the possibility of missing logs.
Currently running into this issue with file buffering turned on and might have a theory why some folks might be running into it.
So, this issue might manifest depending on what the pipe buffer size is.
You could possibly end up with three possible pipe buffer sizes depending on what Linux kernel you are running and if you are hitting the kernel's pipe buffer soft limits
Default pipe buffer size on Linux is going to be 65536 bytes. Since the max tasks you can run is 2048, this pipe size should easily handle all the tasks easily without the pipe being blocked.
If you are hitting the kernel pipe buffer soft limits, your pipe size might be either 4096 bytes (https://github.com/torvalds/linux/blob/v4.18/fs/pipe.c#L642) or 8192 bytes (https://github.com/torvalds/linux/blob/v5.15/fs/pipe.c#L797). On centos/rhel 7/8, your pipe size would be 4096 bytes, while newer kernels would be 8192 bytes. So, you can run into a scenario where you add too many tasks/coroutines at once and the pipe gets blocked on writing. This really means that the pipe can only really handle writing 512 or 1024 tasks and doesn't have room for current max task size of 2048.
As a temp workaround, you could add CAP_SYS_RESOURCE capability to the fluent-bit docker container as that ensures the pipe buffer size is at least 65536 bytes.
A proper solution would be to alter the amount of tasks that can be run based on the pipe's buffer.
Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. I ran into into this issue that has been fixed in the Linux kernel 5.14 (https://github.com/torvalds/linux/commit/46c4c9d1beb7f5b4cec4dd90e7728720583ee348) where you may still block on a pipe write even if the pipe is not fully full.
With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.
Tried lower task size from 2048 to 512 to see how a pipe buffer of 4k would perform if you hit the pipe buffer limits. I ran into into this issue that has been fixed in the Linux kernel 5.14 (torvalds/linux@46c4c9d) where you may still block on a pipe write even if the pipe is not fully full.
With 2048 tasks max with each notification needing 8 bytes, you would need a buffer size of 16384 bytes. If CAP_SYS_RESOURCE capability is not an option, perhaps adding an array of 4k pipe buffers to write should do the trick.
i have tried the 500 tasks, fluent-bit block also exist(but work normal after restart it), maybe coroutine scheduler algorithm has some bugs:)?
I believe we have been hitting this too, on a deployment of roughly 700 nodes across around 40 clusters, we were seeing around 8% of pods getting stuck. Attaching a debugger to the stuck instances we saw a similar stack trace:
Thread 8 (Thread 0x7f25b61fd700 (LWP 13)):
#0 0x00007f25c231c459 in write () from /lib64/libpthread.so.0
#1 0x00000000004ff3e7 in flb_output_return (ret=1, co=0x7f25b5217308) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:649
#2 0x00000000004ff443 in flb_output_return_do (x=1) at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:685
#3 0x00000000004ff4eb in flb_proxy_cb_flush (data=0x7f257439cd80, bytes=790, tag=0x7f25b41cab80 "kube.var.log.containers.off-boarding-manager-6aapa-74c94c54b5-k2nz5_apollo_off-boarding-manager-3477301c92e8e9da2d9c6769983a2c06107c2ebf139548995e7aca30777f639d.log", tag_len=164, i_ins=0x7f25bda3a680, out_context=0x7f25bda22330, config=0x7f25bda19980) at /tmp/fluent-bit-1.8.9/src/flb_plugin_proxy.c:65
#4 0x00000000004d6973 in output_pre_cb_flush () at /tmp/fluent-bit-1.8.9/include/fluent-bit/flb_output.h:514
#5 0x00000000009d3667 in co_init () at /tmp/fluent-bit-1.8.9/lib/monkey/deps/flb_libco/amd64.c:117
If we detach, wait and reattach it is still stuck there.
We think this is maybe only affecting nodes with either large numbers of files tracked, or large log volumes, or both.
When we terminate the stuck pod and it gets replaced we see it is very likely to get stuck again within the next ten to fifteen minutes. We assume that this is because it's on a node that is hosting particularly busy pods or particularly many pods but we're not sure.
We have tested the CAP_SYS_RESOURCE workaround on a subset of affected clusters and it does appear to remove the problem.
This is our current configuration:
[SERVICE]
Flush 1
Log_Level ${LOG_LEVEL}
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Tag kube.*
Path ${PATH}
Parser ${LOG_PARSER}
DB ${FB_DB}
Mem_Buf_Limit 200Mb
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Use_Kubelet On
Buffer_Size 2Mb
K8S-Logging.Exclude On
K8S-Logging.Parser On
Merge_Log On
Keep_Log Off
tls.verify On
[FILTER]
Name nest
Match *
Operation lift
Nested_under kubernetes
[FILTER]
Name nest
Match *
Operation lift
Nested_under labels
Add_prefix labels.
[FILTER]
Name modify
Match *
Rename labels.app service.name
Rename labels.app.kubernetes.io/name service.name
Rename labels.app.kubernetes.io/version service.version
Rename trace_id trace.id
Remove_wildcard labels.
Remove_wildcard annotations
Remove container_hash
Remove docker_id
Remove pod_id
[FILTER]
Name record_modifier
Match *
Record cluster_name ${CLUSTER_NAME}
[OUTPUT]
Name newrelic
Match *
licenseKey ${LICENSE_KEY}
endpoint ${ENDPOINT}
lowDataMode true
I'll note that the Mem_Buf_Limit is set very high, but that was an earlier attempt to fix the problem before we understood what was going wrong. The newrelic output plugin is this: https://github.com/newrelic/newrelic-fluent-bit-output
Some of the relevant environment variables:
- name: LOG_LEVEL
value: info
- name: LOG_PARSER
value: docker
- name: FB_DB
value: /var/log/flb_kube.db
- name: PATH
value: /var/log/containers/*.log
The stack trace above is from 1.8.9 with Amazon's patches, which we had been trying to see if they fixed anything, but we have been having this problem with 1.8.12 too.
The soft pipe limit is per-user, right? As I understand it, Kubernetes doesn't namespace users, so if Fluent Bit is running inside a container as uid 0, and lots of other processes are all running inside containers also as uid 0, then are they all sharing that same limit? That could explain why it is hard to reproduce outside of production - it needs something else running as the same uid to consume all of the soft pipe limit, and it needs heavy load to fill up the queue.
I spent a while trying to make an isolated reproduction, but I haven't been successful yet. I am using one process running as the same user to allocate pipes and using fcntl
with F_SETPIPE_SZ
to resize them until hitting the limit, at which point all new pipes allocated can be observed (via fcntl
with F_GETPIPE_SZ
) to be 4096 bytes (one page). At this point I think what is required to reproduce is to generate a sustained backlog of tasks, but I haven't figured out how to do this in an isolated setup that doesn't depend on Kubernetes or New Relic.
I note @danlenar's observation earlier, that it's possible to block on a write to pipe even if it's not empty. I would make this even more explicit - it's possible to block on a write to a pipe if the unused capacity is at least one byte and less than 4096 bytes. Once the pipe writer reaches a page boundary, it needs a completely free page that does not still contain any data waiting to be read from the pipe. You can observe this behaviour by creating a pipe, resizing it to 4096 bytes, writing 4096 bytes, reading 4095 bytes, then attempting to write 1 byte. Even though the pipe has only a single unread byte remaining, writing even a single byte into the pipe will block until that byte has been read.
Based on this, I think it should be possible to reproduce by creating the soft pipe limit scenario on a machine with kernel <5.1.4, and finding some configuration that spends a meaningful proportion of time with a task queue of two or more tasks. Sooner or later two or more tasks should attempt to write to the (single page capacity) pipe as it is crossing the page boundary.
Bug Report
My fluent-bit 1.5.7 is running in a container in k8s (AKS) environment. It’s configured to collect docker logs (33 tail inputs configured) and send them to elasticsearch (33 outputs) and a few filters. Recently, as the amount of logs per node increased, fluent-bit started sporadically freezing up. The process would continue running consuming 0% cpu and not processing any new logs or filesystem storage backlog. It would however respond to monitoring queries on its http port. After some debugging using strace and gdb, I found that it locks up attempting to write to ch_manager pipe. Here’s the stack trace
strace (filtered by read from fd 13, and write to fd 14, which is the ch_manager pipe fd’s)
it looks like elasticsearch outputs may send so many responses to inputs at the same time that the pipe fills up and blocks in write(). But inputs are running in the same thread, so they can’t read responses from the pipe and fluent-bit locks up.
I produced a dirty fix for it by making ch_manager pipe non-blocking. I also tried extending the size of the pipe (or at least get it), but ioctl fails to get or set new pipe size. See the snippet below; I added the last line.
flb_engine.c:
there's probably a cleaner way to fix it, but this one-liner worked for me. Now I am getting occasional “resource not available” error in the log, but fluent-bit survives and continues crunching logs.
Environment
Version used: 1.5.7 (container fluent/fluent-bit:1.5.7) kubernetes 1.16.13 docker 3.0.10+azure Ubuntu 16.04.1