fluent / fluent-operator

Operate Fluent Bit and Fluentd in the Kubernetes way - Previously known as FluentBit Operator
Apache License 2.0
578 stars 246 forks source link

Hot reload issues #1249

Open chrono2002 opened 2 months ago

chrono2002 commented 2 months ago

Describe the issue

We've got CI which deploys filters, parsers and outputs into several namespaces. It works like this: before deployment it deletes everything in namespace.

Started from version 2.7.0 we've got following errors:

[pod/fluent-bit-v4rzh/fluent-bit] level=info time=2024-07-19T14:05:03Z msg="Config file changed, reloading..."
[pod/fluent-bit-v4rzh/fluent-bit] level=info time=2024-07-19T14:05:03Z msg="Config file changed, reloading..."
[pod/fluent-bit-v4rzh/fluent-bit] level=info time=2024-07-19T14:05:03Z msg="Config file changed, reloading..."
[pod/fluent-bit-ctq8k/fluent-bit] level=info time=2024-07-19T14:05:06Z msg="Config file changed, reloading..."
[pod/fluent-bit-ctq8k/fluent-bit] level=info time=2024-07-19T14:05:06Z msg="Config file changed, reloading..."
[pod/fluent-bit-ctq8k/fluent-bit] level=info time=2024-07-19T14:05:06Z msg="Config file changed, reloading..."
[pod/fluent-bit-p9ftx/fluent-bit] level=info time=2024-07-19T14:05:17Z msg="Config file changed, reloading..."
[pod/fluent-bit-p9ftx/fluent-bit] level=info time=2024-07-19T14:05:17Z msg="Config file changed, reloading..."
[pod/fluent-bit-p9ftx/fluent-bit] level=info time=2024-07-19T14:05:17Z msg="Config file changed, reloading..."

Looks like it is reloading on every object deletion. And when parsers are deleted before filters, it stucks and crashes. Then restarts normally.

[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-playtest-ppp-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-playtest-ppp-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev04-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev04-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa16-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa16-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa18-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa18-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa10-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa10-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev03-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-dev03-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa19-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa19-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa20-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa20-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa17-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa17-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa11-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qa11-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qc-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-qc-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-mainline-qa-rewrite] initializing
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [ info] [input:emitter:prod-mainline-qa-rewrite] storage_strategy='filesystem' (memory + filesystem)
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] [filter:parser:parser.323] requested parser 'cw-meta-meta-server-json-message-time-field-60d52537bbd89f341cbf30ffd3c7677d' not found
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] [filter:parser:parser.323] Invalid 'parser'
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] Failed initialize filter parser.323
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:39] [error] [engine] filter initialization failed
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing tail.0
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing storage_backlog.1
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa09-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-gd01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa03-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-ppp-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev04-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa16-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa18-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa10-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev03-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa19-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa20-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa17-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa11-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qc-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-mainline-qa-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa15-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa04-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa02-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-mainline-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx02-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-ld01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev02-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa07-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa12-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-consoles-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev01-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa14-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa05-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx03-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa06-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa13-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa08-re_emitter
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa09-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-xxx01-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-gd01-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa03-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-playtest-ppp-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev04-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa16-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa18-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa10-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-dev03-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa19-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa20-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa17-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qa11-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-qc-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [ info] [input] pausing prod-mainline-qa-rewrite
[pod/fluent-bit-xp5lp/fluent-bit] [2024/07/19 13:57:40] [error] [reload] loaded configuration contains error(s). Reloading is aborted
[pod/fluent-bit-xp5lp/fluent-bit] reloading is aborted and exit
[pod/fluent-bit-xp5lp/fluent-bit] level=error time=2024-07-19T13:57:40Z msg="Failure during the run time of fluent-bit" error="failed to run fluent-bit: exit status 255"

To Reproduce

Expected behavior

Your Environment

- Fluent Operator version: >2.7.0
- Container Runtime: containerd
- Operating system: ubuntu
- Kernel version:

How did you install fluent operator?

helm

Additional context

No response

cw-Guo commented 2 months ago

reload only once when all operations in ns finishes

how does fluent-operator know when your operations are done?

in my opinion, you can control the create/delete orders in your CI system and this problem will be resolved.

chrono2002 commented 2 months ago

reload only once when all operations in ns finishes

how does fluent-operator know when your operations are done?

in my opinion, you can control the create/delete orders in your CI system and this problem will be resolved.

how exactly you're suggesting to control it? we have helm chart that simple install parsers, filters and outputs we've tried to place parsers section before filters, or filters section before parsers, no luck

Cajga commented 2 weeks ago

@cw-Guo we use gitops to deploy and when we deploy a bigger application, many fluent-operator CRs gets created that seems to trigger many reload on fluent-bit pods.

This causes troubles for us as fluent-bit starts hanging from time to time (https://github.com/fluent/fluent-operator/issues/1332).

It seems, fluent-bit has some issues with hot reload: https://github.com/fluent/fluent-bit/issues/9354

While, these are most probably fluent-bit bugs, maybe being a bit more "kind" with the reload requests could help.

How about a solution that instead of immediately reload on every CR change, fluent-operator would "collect" the changes for some definable period (like 1 minute) and call a single reload only once if any change has happened during this period.

ping @markusthoemmes

markusthoemmes commented 2 weeks ago

I'm not really active in this project right now, but I did solve this internally eventually. Essentially, I've created a script that gets the current reloads (GET "http://0.0.0.0:2020/api/v2/reload") and then runs a hot reload. Afterwards it gets the reloads again. If they are the same as before, retry the reload. The need for that was supposed to be fixed via https://github.com/fluent/fluent-bit/issues/8457 though, so now we should be able to handle the return value of the reload and retry on error.