Open marcozov opened 3 weeks ago
Seems we run into the same issue. This happens randomly on random nodes. For me, it seems, fluent-bit just starts to hang sometimes when a reload (or multiple reloads in a short period) happening. @marcozov have you found any workaround
NOTE: when this happens fluent-bit does not react on requests. Like if I port-forward port 2020 and call:
curl -X POST -d '{}' localhost:2020/api/v2/reload
It times out and nothing is written into the logs of that pod.
@markusthoemmes, would you have any hints how could we debug this further? For us it happens around once per day.
Describe the issue
I noticed this behaviour: sometimes fluentbit pods (usually one) just stop sending logs to the configured outputs (Elasticsearch in this case). There is no error message or any weird log appearing, besides the logline that informs that the config file changed (
level=info time=2024-09-03T09:57:40Z msg="Config file changed, reloading..."
). I configured fluentbit via namespace-scoped CRDs (FluentBitConfig, Output, but I have the ClusterFilter, ClusterInput, ClusterFluentBitConfig cluster-wide CRDs as well), and I noticed that the fluent-bit-config configmap changes the order of the configured CRDs (e.g., the order of the configured outputs changes from time to time, although the overall content remains the same). The underlying idea is to create an elasticsearch index per kubernetes namespace and forward all the namespace's traffic there.The generated configuration (configured entirely via the operator CRDs) looks like this:
I noticed that if I check the prometheus targets, that pod results in the DOWN state:
I also tried to manually curl the pod (from another pod) and, indeed, it is not reachable (while all the others belonging to the same daemonset are reachable and they keep sending the logs to the intended outputs). Any idea about what is going on? I also tried to check whether there could be resources (in terms of cpu/ram) issues (via
kubectl describe
andkubectl get pods -o yaml
) but nothing obvious appeared. The pod is not failing nor getting restarted.Once the pod is restarted, everything works again as intended.
To Reproduce
I can't really reproduce the issue, since it happens randomly and it's "solved" once the pod is killed (and a new one is created)
Expected behavior
The logs are sent to the intended output or at least there is a "talking" error message that highlights the cause of this behavior.
Your Environment
How did you install fluent operator?
Via helm chart (with argocd, from the https://fluent.github.io/helm-charts/ repository), with the following parameters:
I disabled the inputs and filter because I configured them by myself in order to use namespace-wide CRDs.
Additional context
No response