Open PettitWesley opened 3 years ago
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
I am seeing a similar issue running fluent bit with Kubernetes. If there are container logs that are rotated frequently by kubelet then the instances of fluent-bit on those hosts will grow in memory usage until they OOM. Seems likely there is a memory leak around how the log rotation is handled.
+1 observing the similar issue
I have the similar issue. :(
It seems to have a memory leak. The settings I am using are as follows.
INPUT: http FILTER: expect, rewrite_tag, record_modifier OUTPUT: http, kafka
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
@PettitWesley We have identified a possible cause for this issue. https://github.com/fluent/fluent-bit/blob/master/plugins/out_cloudwatch_logs/cloudwatch_api.c#L1017 Every time get_or_create_log_streams is invoked, it checks whether any streams are expired. If the streams are not expired then it creates a new stream. And now the default expiration time for a stream is set as 4 hours. (https://github.com/fluent/fluent-bit/blob/master/plugins/out_cloudwatch_logs/cloudwatch_api.c#L1066) If the log rotation time is lower, the created log streams are not freed up by the system until 4 hours are reached, potentially leaking the memory for a long-running job. As a quick fix
@nithin-kumar good thought. I will fix this in cloudwatch_logs. The issue description says rotation happens every 1 minute, which is fast. I bet there could be many other places in the code which have similar behavior in which memory accumulates proportional to the number of tags processed over time that we/anyone could find/profile @JeffLuoo if we had time...
I got a report from one of my customers which is interesting:
Currently, we use Fluent Bit to tail log files that are rotated by our app and are then redirected to S3 and Cloudwatch. The way the rotation works is that every 60 seconds, the driver then renames the active file to something like "rotated-file.1" and creates a new file with name "rotated-file-to-watch" that FluentBit is set to watch (real file names redacted).
When running a long test job, Prometheus is detecting that maximum memory usage is ~400 MB when this rotation is turned on compared to ~175 MB when rotation is turned off, with nothing else changed. This increase in memory usage is unusual/surprising.
Is this expected? Is it normal- does it match other user results? Is there any way to tune the config for this case?
The specific FluentBit configuration (slightly redacted) is below: