fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.74k stars 1.56k forks source link

Hot reload stuck in progress after pausing inputs #9354

Open yuvalhnoga opened 2 weeks ago

yuvalhnoga commented 2 weeks ago

Bug Report

Executing a hot reload with from a sidecar container in the same pod as fluent-bit with curl -X POST http://localhost:2020/api/v2/reload -d '{}' gets stuck and if you try to reload again you get {"reload":"in progress","status":-2}

In the fluent-bit configuration we can see this and its stuck like this forever:

[engine] caught signal (SIGHUP)
[ info] reloading instance pid=1 tid=0x7f20c800ee40
[ info] [reload] stop everything of the old context
[ warn] [engine] service will shutdown when all remaining tasks are flushed
[ info] [input] pausing input-kube-pods-logs
[ info] [input] pausing input-kube-node-logs
[ info] [input] pausing cpu.2
[ info] [engine] service has stopped (0 pending tasks)
[ info] [input] pausing input-kube-pods-logs
[ info] [input] pausing input-kube-node-logs
[ info] [input] pausing cpu.2

Normally when the reload works the lines that come after this should be:

[input] pausing input-kube-node-logs
[input] pausing cpu.2
[output:forward:forward-to-fluentd] thread worker #0 stopping...
[output:forward:forward-to-fluentd] thread worker #0 stopped
[output:forward:forward-to-fluentd] thread worker #1 stopping...
[output:forward:forward-to-fluentd] thread worker #1 stopped
[input:tail:input-kube-node-logs] inotify_fs_remove(): inode=3146021 watch_fd=1
[input:tail:input-kube-node-logs] inotify_fs_remove(): inode=3145852 watch_fd=2
[input:tail:input-kube-node-logs] inotify_fs_remove(): inode=3145851 watch_fd=3
[input:tail:input-kube-node-logs] inotify_fs_remove(): inode=3146020 watch_fd=4
[reload] start everything

To Reproduce

Create fluent-bit.yaml:

service:
    flush: 5
    daemon: off
    http_server: on
    http_listen: 0.0.0.0
    http_port: 2020
    health_check: on
    hot_reload: on
    hc_errors_count: 5
    hc_retry_failure_count: 5
    hc_period: 5
    log_level: info
    parsers_file: /fluent-bit/etc/parsers.conf

pipeline:
  inputs:
    - name: cpu
      tag: temp_cpu
  filters:
    - name: throttle
      match: kubernetes.var.log.containers.example1-*
      alias: example1
      rate: 999999
      window: 999999
      interval: 999999s
  outputs:
    - name: forward
      alias: forward-to-fluentd
      match: kubernetes.*
      upstream: /fluent-bit/etc/upstream.conf
      port: 24224
      retry_limit: false

includes:
    - /fluent-bit/filters/*.yaml
    - /fluent-bit/config/filters.yaml

Expected behavior The reload should fail/continue but not hang.

Your Environment

Additional context I am running a sidecar next to my fluent-bit container that dynamically creates a filter with throttling configuration that is per container on the same node as fluent-bit(fluent-bit runs as daemonset in my kubernetes cluster). Everytime theres a change to the filter configuration the sidecar performs a hot reload with curl -X POST http://localhost:2020/api/v2/reload -d '{}'

Is there maybe an option to verify when the reload fails and be able to retry it because if I try to run the reload again it just says it's in progress forever. Also is there a way to monitor or know that it happens? Since health check shows as ok, I only noticed that happened because I stopped receiving logs. Would be nice if there was a way to know about it and maybe restart the fluent-bit container or perform some sort of forced reload.

yuvalhnoga commented 2 weeks ago

Seems like reproducing this issue is not as easy as I thought but I have some more insight:

I managed to find all the pods where this issue happens by querying this in prometheus:

increase(fluentbit_uptime[5m]) == 0

I noticed that the pods that stop sending logs also have the fluentbit_uptime metric stop increasing I looked at the resource consumption of the fluent-bit container and the memory in more than half of them is around 20-30mb memory usage where my requests is 64mb and limits is 256mb As for CPU resource consumption it was pretty much the same (not fully used).

I checked how many hot reloads back to back it took for it to fail and it seems like it was 2 reloads in 30s increments but in some it was even 2mins. In one of them I noticed it took just 1 reload to make it stop working and both cpu/memory were low in the container.

I tried reproducing it spamming the hot reload curl in multiple different kind of pods with different scenarios:

I didn’t find any of these criterias to affect if the hot reload will fail even when spamming the hot reload curl nonstop without any wait.

Not really sure what else it could be but it’s a tough one :stuck_out_tongue: