Open yuvalhnoga opened 2 weeks ago
Seems like reproducing this issue is not as easy as I thought but I have some more insight:
I managed to find all the pods where this issue happens by querying this in prometheus:
increase(fluentbit_uptime[5m]) == 0
I noticed that the pods that stop sending logs also have the fluentbit_uptime
metric stop increasing
I looked at the resource consumption of the fluent-bit
container and the memory in more than half of them is around 20-30mb
memory usage where my requests is 64mb
and limits is 256mb
As for CPU resource consumption it was pretty much the same (not fully used).
I checked how many hot reloads back to back it took for it to fail and it seems like it was 2 reloads in 30s increments but in some it was even 2mins. In one of them I noticed it took just 1 reload to make it stop working and both cpu/memory were low in the container.
I tried reproducing it spamming the hot reload curl in multiple different kind of pods with different scenarios:
I didn’t find any of these criterias to affect if the hot reload will fail even when spamming the hot reload curl nonstop without any wait.
Not really sure what else it could be but it’s a tough one :stuck_out_tongue:
Bug Report
Executing a hot reload with from a sidecar container in the same pod as fluent-bit with
curl -X POST http://localhost:2020/api/v2/reload -d '{}'
gets stuck and if you try to reload again you get{"reload":"in progress","status":-2}
In the fluent-bit configuration we can see this and its stuck like this forever:
Normally when the reload works the lines that come after this should be:
To Reproduce
Create fluent-bit.yaml:
curl -X POST http://localhost:2020/api/v2/reload -d '{}'
multiple times until the reload hangs like I mentioned aboveExpected behavior The reload should fail/continue but not hang.
Your Environment
Additional context I am running a sidecar next to my fluent-bit container that dynamically creates a filter with throttling configuration that is per container on the same node as fluent-bit(fluent-bit runs as daemonset in my kubernetes cluster). Everytime theres a change to the filter configuration the sidecar performs a hot reload with
curl -X POST http://localhost:2020/api/v2/reload -d '{}'
Is there maybe an option to verify when the reload fails and be able to retry it because if I try to run the reload again it just says it's in progress forever. Also is there a way to monitor or know that it happens? Since health check shows as ok, I only noticed that happened because I stopped receiving logs. Would be nice if there was a way to know about it and maybe restart the fluent-bit container or perform some sort of forced reload.