fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.54k stars 1.52k forks source link

Hot Reload _sometimes_ does not work #8918

Open cdancy opened 1 month ago

cdancy commented 1 month ago

Hey All,

We've been putting together a POC here to use the hot reload feature of fluent-bit. The general idea is that some custom CRD will be installed, a customer k8s operator we have will listen for the CRD and read its values and then add a file to the fluent-bit configmap which defines a new output, then our operator will send an HTTP request to fluent-bit to reload itself and pick up the newly added configmap file. Our fluent-bit.conf file looks like so:

[SERVICE]
    Flush 60
    Log_Level debug
    Parsers_File /fluent-bit/etc/parsers.conf
    Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On
    Hot_Reload On

@INCLUDE *_inputs.conf
@INCLUDE *_filters.conf
@INCLUDE *_outputs.conf

When we go to add a new file to the configmap we'll name it something like hello-world_outputs.conf, then POST to the "reload endpoint", and expect those logs to start flowing...

50% of the time this works 100% of the time :)

Whenever we hit the reload endpoint we always see the relevant logging about catching the sighup signal and see fluent-bit recycle itself from within... so that seems to work. What doesn't always work is fluentbit actually sending those logs to the new OUTPUT destination. We don't see anything within the logs themselves to suggest things are amiss even with debug enabled. Fluentbit is clearly picking things up as we see logs related to the store_dir value (this is an S3 plugin) but logs never end up making it to the bucket. We've got the upload_timeout set to 60s and the total_file_size set to 5M. We've waited hours and still nothing starts streaming even though we have test pods that dumping loads of logs per second.

This is somewhat reproducible for us in that, the very first time we add the new file to the existing configmap and hit the reload endpoint, things don't work. If we delete the file from the configmap then hit the reload endpoint then add that file back to the configmap and then hit the reload endpoint again things seem to start working as add as that may seem.

We tried latest fluent-bit on the 2.x series, as well as the 3.x series, and still the same behavior.

What does work for us though is if we add the noted file to the existing configmap, then effectively do a rolling restart, fluent-bit will 100% of the time pick up the new file and start sending logs as expected.

Any ideas on what might be going on? Is this a potential bug? Anything else I can give you all to help diagnose the issue? Again ... this doesn't seem to happen all the time but something like 50% of the time if I had to guess. It's really weird and odd behavior and feels like we either may just be getting lucky when it does work, or there is some cache issue at play, or something else.

It should also be noted we have, by default, about 10 or so outputs defined. I don't know if that matters one way or another but just putting it out there in case this is somehow related to load or too many outputs or whatever. For testing purposes we did trim the "default outputs" down to just 1 but that didn't seem to help at all.

Any help or pointers would be greatly appreciated as we would love to be able to go to production with just hitting the "reload endpoint" and not have to call k8s api to recycle our fluent-bit pods manually.

EDIT: something else worth noting ... whenever we hit the GET endpoint directly after a reload it's always empty. No json or anything.

Thanks, Chris

cdancy commented 1 month ago

tagging @edsiper @PettitWesley as you all have helped us before and are very knowledgeable in this area.

stevehipwell commented 6 days ago

@patrick-stephens we're seeing this too.