Open juan-domenech opened 3 months ago
Is there any update on this? I can confirm that I'm also facing the same issue
@amanbisht Can you do me the favour of validating my "To Reproduce" steps? It would give me peace of mind to know that others can reproduce this issue in their environments. Thank you!
In my case, we migrated from 2.1.3 to 3.1.6 and we started seeing a lot OOMs errors. I can confirm it's happening because of very long loglines. We identified the very long log lines and either suppressed them or shorten them, once we did that OOM related errors stopped showing up.
Bug Report
Fluent-bit gets killed by Kubernetes due Out Of Memory while trying to assemble (very) long lines in Kubernetes no matter the value of
Buffer_Max_Size
.Currently in Kubernetes, log lines longer than 16 KBytes get truncated and appended independently in the log files using partial flags (CRI log format). Fluent-bit Tail input can put them back together using
multiline.parser cri
but this operation does not take into account the configured memory limits. i.e. the assembling loop can go out of bounds if the final result is big enough.This is a known issue and mentioned in the Multiline official documentation.
but when this growth is above the memory assigned to the container it will be killed creating disturbances in production environments.
To Reproduce
Create a container for Fluent-bit with a set amount of resources:
Fluent-bit configuration (Intending to drop anything above 64K)
Open shell in one of your pods (not Fluent-bit)
Generate a big log line
This will create a 4 MBytes log line out of the container that the container manager running in the node will split into 16 KBytes long chunks and append them to the container log file.
Fluent-bit will try to put these lines together and K8 will OOM kill the Fluent-bit container during the process (If it doesn't, feel free to increase the number on the above command line)
There are no Fluent-bit logs showing any issue. It dies before having a chance to say anything. This is the K8 node
dmesg
output:Expected behavior Fluent-bit to skip partial log lines in the
multiline.parser cri
Tail loop when the memory limit is reached.Your Environment
Workaround (No workaround necessary) Fortunately, Fluent-bit recovers gracefully after this and does not get into a crash loop (i.e. trying to assemble the same log lines again). By the looks of it, during the assemble loop it manages to update the file pointer (stored in
DB
) and the new instance picks it up where it left it. In some cases this could trigger another OOM if the long lines persist, but eventually the problematic data will be "processed" producing truncated log events (not ideal but not critical).References There are other issues describing similar OOM problems like #5711 and #5685 but I think they relate to back pressure coming from Outputs. I think this is a different matter.
Thanks!