We've been putting together a POC here to use the hot reload feature of fluent-bit. The general idea is that some custom CRD will be installed, a customer k8s operator we have will listen for the CRD and read its values and then add a file to the fluent-bit configmap which defines a new output, then our operator will send an HTTP request to fluent-bit to reload itself and pick up the newly added configmap file. Our fluent-bit.conf file looks like so:
[SERVICE]
Flush 60
Log_Level debug
Parsers_File /fluent-bit/etc/parsers.conf
Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
Hot_Reload On
@INCLUDE *_inputs.conf
@INCLUDE *_filters.conf
@INCLUDE *_outputs.conf
When we go to add a new file to the configmap we'll name it something like hello-world_outputs.conf, then POST to the "reload endpoint", and expect those logs to start flowing...
50% of the time this works 100% of the time :)
Whenever we hit the reload endpoint we always see the relevant logging about catching the sighup signal and see fluent-bit recycle itself from within... so that seems to work. What doesn't always work is fluentbit actually sending those logs to the new OUTPUT destination. We don't see anything within the logs themselves to suggest things are amiss even with debug enabled. Fluentbit is clearly picking things up as we see logs related to the store_dir value (this is an S3 plugin) but logs never end up making it to the bucket. We've got the upload_timeout set to 60s and the total_file_size set to 5M. We've waited hours and still nothing starts streaming even though we have test pods that dumping loads of logs per second.
This is somewhat reproducible for us in that, the very first time we add the new file to the existing configmap and hit the reload endpoint, things don't work. If we delete the file from the configmap then hit the reload endpoint then add that file back to the configmap and then hit the reload endpoint again things seem to start working as add as that may seem.
We tried latest fluent-bit on the 2.x series, as well as the 3.x series, and still the same behavior.
What does work for us though is if we add the noted file to the existing configmap, then effectively do a rolling restart, fluent-bit will 100% of the time pick up the new file and start sending logs as expected.
Any ideas on what might be going on? Is this a potential bug? Anything else I can give you all to help diagnose the issue? Again ... this doesn't seem to happen all the time but something like 50% of the time if I had to guess. It's really weird and odd behavior and feels like we either may just be getting lucky when it does work, or there is some cache issue at play, or something else.
It should also be noted we have, by default, about 10 or so outputs defined. I don't know if that matters one way or another but just putting it out there in case this is somehow related to load or too many outputs or whatever. For testing purposes we did trim the "default outputs" down to just 1 but that didn't seem to help at all.
Any help or pointers would be greatly appreciated as we would love to be able to go to production with just hitting the "reload endpoint" and not have to call k8s api to recycle our fluent-bit pods manually.
EDIT: something else worth noting ... whenever we hit the GET endpoint directly after a reload it's always empty. No json or anything.
Hey All,
We've been putting together a POC here to use the hot reload feature of fluent-bit. The general idea is that some custom CRD will be installed, a customer k8s operator we have will listen for the CRD and read its values and then add a file to the fluent-bit configmap which defines a new output, then our operator will send an HTTP request to fluent-bit to reload itself and pick up the newly added configmap file. Our
fluent-bit.conf
file looks like so:When we go to add a new file to the configmap we'll name it something like
hello-world_outputs.conf
, then POST to the "reload endpoint", and expect those logs to start flowing...50% of the time this works 100% of the time :)
Whenever we hit the reload endpoint we always see the relevant logging about catching the sighup signal and see fluent-bit recycle itself from within... so that seems to work. What doesn't always work is fluentbit actually sending those logs to the new OUTPUT destination. We don't see anything within the logs themselves to suggest things are amiss even with debug enabled. Fluentbit is clearly picking things up as we see logs related to the
store_dir
value (this is an S3 plugin) but logs never end up making it to the bucket. We've got theupload_timeout
set to 60s and thetotal_file_size
set to 5M. We've waited hours and still nothing starts streaming even though we have test pods that dumping loads of logs per second.This is somewhat reproducible for us in that, the very first time we add the new file to the existing configmap and hit the reload endpoint, things don't work. If we delete the file from the configmap then hit the reload endpoint then add that file back to the configmap and then hit the reload endpoint again things seem to start working as add as that may seem.
We tried latest fluent-bit on the 2.x series, as well as the 3.x series, and still the same behavior.
What does work for us though is if we add the noted file to the existing configmap, then effectively do a rolling restart, fluent-bit will 100% of the time pick up the new file and start sending logs as expected.
Any ideas on what might be going on? Is this a potential bug? Anything else I can give you all to help diagnose the issue? Again ... this doesn't seem to happen all the time but something like 50% of the time if I had to guess. It's really weird and odd behavior and feels like we either may just be getting lucky when it does work, or there is some cache issue at play, or something else.
It should also be noted we have, by default, about 10 or so outputs defined. I don't know if that matters one way or another but just putting it out there in case this is somehow related to load or too many outputs or whatever. For testing purposes we did trim the "default outputs" down to just 1 but that didn't seem to help at all.
Any help or pointers would be greatly appreciated as we would love to be able to go to production with just hitting the "reload endpoint" and not have to call k8s api to recycle our fluent-bit pods manually.
EDIT: something else worth noting ... whenever we hit the
GET
endpoint directly after a reload it's always empty. No json or anything.Thanks, Chris