bug: FluentBit pod stops sending logs to the configured outputs without showing any error

marcozov commented 3 weeks ago

Describe the issue

I noticed this behaviour: sometimes fluentbit pods (usually one) just stop sending logs to the configured outputs (Elasticsearch in this case). There is no error message or any weird log appearing, besides the logline that informs that the config file changed (level=info time=2024-09-03T09:57:40Z msg="Config file changed, reloading..."). I configured fluentbit via namespace-scoped CRDs (FluentBitConfig, Output, but I have the ClusterFilter, ClusterInput, ClusterFluentBitConfig cluster-wide CRDs as well), and I noticed that the fluent-bit-config configmap changes the order of the configured CRDs (e.g., the order of the configured outputs changes from time to time, although the overall content remains the same). The underlying idea is to create an elasticsearch index per kubernetes namespace and forward all the namespace's traffic there.

The generated configuration (configured entirely via the operator CRDs) looks like this:

[Service]
    Http_Server    true
    Log_Level    debug
    Parsers_File    /fluent-bit/config/parsers.conf
    Parsers_File    /fluent-bit/config/parsers_multiline.conf
[Input]
    Name    tail
    Path    /var/log/containers/*.log
    Read_from_Head    false
    Refresh_Interval    10
    Skip_Long_Lines    true
    DB    /fluent-bit/tail/pos.db
    DB.Sync    Normal
    Mem_Buf_Limit    100MB
    Parser    cri
    Tag    kube.*
    storage.type    memory
[Filter]
    Name    kubernetes
    Match    kube.*
    Kube_URL    https://kubernetes.default.svc:443
    Kube_CA_File    /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File    /var/run/secrets/kubernetes.io/serviceaccount/token
    Labels    false
    Annotations    false
[Filter]
    Name    nest
    Match    kube.*
    Operation    lift
    Nested_under    kubernetes
    Add_prefix    kubernetes_
[Filter]
    Name    modify
    Match    kube.*
    Remove    stream
    Remove    kubernetes_pod_id
    Remove    kubernetes_host
    Remove    kubernetes_container_hash
[Filter]
    Name    nest
    Match    kube.*
    Operation    nest
    Wildcard    kubernetes_*
    Nest_under    kubernetes
    Remove_prefix    kubernetes_
[Filter]
    Name    rewrite_tag
    Match    kube.*
    Rule    $kubernetes['namespace_name'] ^(NAMESPACE1)$ 977c7a0a4554b6071fdf7e33484f3853.$TAG false
    Emitter_Name    re_emitted_977c7a0a4554b6071fdf7e33484f3853
[Filter]
    Name    rewrite_tag
    Match    kube.*
    Rule    $kubernetes['namespace_name'] ^(NAMESPACE2)$ a692cd3035eea3e14eae8ab89d9aed1b.$TAG false
    Emitter_Name    re_emitted_a692cd3035eea3e14eae8ab89d9aed1b
...
...
[Output]
    Name    es
    Match_Regex    ^977c7a0a4554b6071fdf7e33484f3853\.(?:kube|service)\.(.*)
    Host    ELASTICSEARCH_HOST
    Port    9200
    HTTP_User    admin
    HTTP_Passwd    password
    Index    index-namespace1
    Logstash_Format    false
    Time_Key    @timestamp
    Generate_ID    false
    Suppress_Type_Name    On
    tls    On
    tls.verify    false
[Output]
    Name    es
    Match_Regex    ^a692cd3035eea3e14eae8ab89d9aed1b\.(?:kube|service)\.(.*)
    Host    ELASTICSEARCH_HOST
    Port    9200
    HTTP_User    admin
    HTTP_Passwd    password
    Index    index-namespace2
    Logstash_Format    false
    Time_Key    @timestamp
    Generate_ID    false
    Suppress_Type_Name    On
    tls    On
    tls.verify    false

I noticed that if I check the prometheus targets, that pod results in the DOWN state:

Get "http://10.42.139.104:2020/api/v2/metrics/prometheus": context deadline exceeded

I also tried to manually curl the pod (from another pod) and, indeed, it is not reachable (while all the others belonging to the same daemonset are reachable and they keep sending the logs to the intended outputs). Any idea about what is going on? I also tried to check whether there could be resources (in terms of cpu/ram) issues (via kubectl describe and kubectl get pods -o yaml) but nothing obvious appeared. The pod is not failing nor getting restarted.

Once the pod is restarted, everything works again as intended.

To Reproduce

I can't really reproduce the issue, since it happens randomly and it's "solved" once the pod is killed (and a new one is created)

Expected behavior

The logs are sent to the intended output or at least there is a "talking" error message that highlights the cause of this behavior.

Your Environment

- Fluent Operator version: 2.9.0
- Container Runtime: containerd://1.7.17-k3s1
- Operating system: SUSE Linux Enterprise Server 15 SP4
- Kernel version: 5.14.21-150400.24.74-default

How did you install fluent operator?

Via helm chart (with argocd, from the https://fluent.github.io/helm-charts/ repository), with the following parameters:

containerRuntime: containerd
# fluentd:
#   crdsEnable: false
fluentbit:
  logLevel: debug
  resources:
    limits:
      cpu: 1500m
      memory: 1000Mi
    requests:
      cpu: 500m
      memory: 201Mi
  serviceMonitor:
    enable: true
  input:
    tail:
      enable: false
    systemd:
      enable: false
  filter:
    kubernetes:
      enable: false
    containerd:
      enable: false
    systemd:
      enable: false
operator:
  resources:
    limits:
      cpu: 500m
      memory: 600Mi
    requests:
      cpu: 100m
      memory: 20Mi

I disabled the inputs and filter because I configured them by myself in order to use namespace-wide CRDs.

Additional context

No response

Cajga commented 2 weeks ago

Seems we run into the same issue. This happens randomly on random nodes. For me, it seems, fluent-bit just starts to hang sometimes when a reload (or multiple reloads in a short period) happening. @marcozov have you found any workaround

Cajga commented 2 weeks ago

NOTE: when this happens fluent-bit does not react on requests. Like if I port-forward port 2020 and call:

curl -X POST -d '{}' localhost:2020/api/v2/reload

It times out and nothing is written into the logs of that pod.

@markusthoemmes, would you have any hints how could we debug this further? For us it happens around once per day.

fluent / fluent-operator