Open im-jinxinwang opened 6 months ago
Hi @im-jinxinwang,
is this reproducible for you building from master
?
I just tried it, and it works (minus the kafka, but not issue hot-reloading).
fluent-bit -c 8817.conf
Fluent Bit v3.0.4
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
___________.__ __ __________.__ __ ________
\_ _____/| | __ __ ____ _____/ |_ \______ \__|/ |_ ___ _\_____ \
| __) | | | | \_/ __ \ / \ __\ | | _/ \ __\ \ \/ / _(__ <
| \ | |_| | /\ ___/| | \ | | | \ || | \ / / \
\___ / |____/____/ \___ >___| /__| |______ /__||__| \_/ /______ /
\/ \/ \/ \/ \/
[2024/05/11 17:29:25] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=39607
[2024/05/11 17:29:25] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/11 17:29:25] [ info] [cmetrics] version=0.9.0
[2024/05/11 17:29:25] [ info] [ctraces ] version=0.5.1
[2024/05/11 17:29:25] [ info] [input:tail:tail.0] initializing
[2024/05/11 17:29:25] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/11 17:29:25] [ info] [input:tail:tail.0] multiline core started
[2024/05/11 17:29:25] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/11 17:29:25] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/11 17:29:25] [ info] [sp] stream processor started
[2024/05/11 17:29:32] [engine] caught signal (SIGHUP)
[2024/05/11 17:29:32] [ info] reloading instance pid=39607 tid=0x2053d3ac0
[2024/05/11 17:29:32] [ info] [reload] stop everything of the old context
[2024/05/11 17:29:32] [ info] [input] pausing tail.0
[2024/05/11 17:29:33] [ info] [reload] start everything
[2024/05/11 17:29:33] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=39607
[2024/05/11 17:29:33] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/11 17:29:33] [ info] [cmetrics] version=0.9.0
[2024/05/11 17:29:33] [ info] [ctraces ] version=0.5.1
[2024/05/11 17:29:33] [ info] [input:tail:tail.0] initializing
[2024/05/11 17:29:33] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/11 17:29:33] [ info] [input:tail:tail.0] multiline core started
[2024/05/11 17:29:33] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/11 17:29:33] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/11 17:29:33] [ info] [sp] stream processor started
[2024/05/11 17:30:03] [ warn] [output:kafka:kafka.0] fluent-bit#producer-2: [thrd:192.168.123.50:9092/bootstrap]: 192.168.123.50:9092/bootstrap: Connection setup timed out in state CONNECT (after 30026ms in state CONNECT)
[2024/05/11 17:30:03] [error] [output:kafka:kafka.0] fluent-bit#producer-2: [thrd:192.168.123.50:9092/bootstrap]: 1/1 brokers are down
[2024/05/11 17:30:34] [ warn] [output:kafka:kafka.0] fluent-bit#producer-2: [thrd:192.168.123.50:9092/bootstrap]: 192.168.123.50:9092/bootstrap: Connection setup timed out in state CONNECT (after 30025ms in state CONNECT, 1 identical error(s) suppressed)
Hi @lecaros I see that the version you started is different from mine. I have the same problem in both versions 3.0.2 and 3.0.3 that hot reloading is not possible.
I compiled the latest code, and 3.0.4 does not have this problem, but 3.0.4 has not been released.
Hi @lecaros When using Kafka output and hot loading, I also encountered the same problem and verified that 2.2.2, 3.0.3, 3.0.4, and 3.0.5 all had similar issues.
fluent-bit.conf
[SERVICE]
flush 10
daemon Off
log_level trace
Hot_Reload On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
HTTP_Server On
parsers_file /etc/fluent-bit/parsers.conf
plugins_file /etc/fluent-bit/plugins.conf
[INPUT]
Name random
Tag test4
Samples 10
[FILTER]
Name modify
Match test4
Add flb_topic flb-baron-test
[OUTPUT]
Name kafka
Match *
Brokers 192.168.7.10:9092,192.168.7.11:9092
Topics fluent-logs
topic_key flb_topic
Dynamic_Topic on
Describe the bug
[2024/05/25 03:31:25] [ info] [sp] stream processor started
[2024/05/25 03:31:26] [trace] [filter:modify:modify.0 at /tmp/fluent-bit/plugins/filter_modify/modify.c:1426] Input map size 1 elements, output map size 2 elements
[2024/05/25 03:31:26] [debug] [input chunk] update output instances with new chunk size diff=67, records=1, input=random.0
[2024/05/25 03:31:26] [engine] caught signal (SIGHUP)
[2024/05/25 03:31:26] [ info] reloading instance pid=3163754 tid=0x7f4391275340
[2024/05/25 03:31:26] [ info] [reload] stop everything of the old context
[2024/05/25 03:31:26] [trace] [engine] flush enqueued data
[2024/05/25 03:31:26] [trace] [task 0x7f438f836c80] created (id=0)
[2024/05/25 03:31:26] [debug] [task] created task=0x7f438f836c80 id=0 OK
[2024/05/25 03:31:26] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2024/05/25 03:31:26] [ info] [input] pausing random.0
{"[2024/05/25 03:31:26] [debug] in produce_message
rand_value"=>10952961381925857816, "flb_topic"=>"flb-baron-test"}[2024/05/25 03:31:26] [ info] [out_kafka] new topic added: flb-baron-test
[2024/05/25 03:31:26] [debug] [output:kafka:kafka.0] enqueued message (95 bytes) for topic 'flb-baron-test'
[2024/05/25 03:31:26] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2024/05/25 03:31:26] [debug] [out flush] cb_destroy coro_id=0
[2024/05/25 03:31:26] [trace] [coro] destroy coroutine=0x7f438f80bc38 data=0x7f438f80bc50
[2024/05/25 03:31:26] [debug] [task] destroy task=0x7f438f836c80 (task_id=0)
[2024/05/25 03:31:27] [ info] [engine] service has stopped (0 pending tasks)
[2024/05/25 03:31:27] [ info] [input] pausing random.0
[2024/05/25 03:31:27] [debug] [output:kafka:kafka.0] message delivered (95 bytes, partition 0)
[2024/05/25 03:31:27] [engine] caught signal (SIGSEGV)
#0 0x4913b6 in ???() at ???:0
#1 0x526459 in ???() at ???:0
#2 0x527732 in ???() at ???:0
#3 0x513cbb in ???() at ???:0
#4 0x513d3d in ???() at ???:0
#5 0x54d52a in ???() at ???:0
#6 0x54f2ef in ???() at ???:0
#7 0x54f403 in ???() at ???:0
#8 0x57763d in ???() at ???:0
#9 0x577270 in ???() at ???:0
#10 0x51866b in ???() at ???:0
#11 0x7f439109f7f1 in ???() at ???:0
#12 0x7f439103f44f in ???() at ???:0
#13 0xffffffffffffffff in ???() at ???:0
Aborted (core dumped)
I still can't understand it. I compiled it myself without any problem. But this problem will occur when using their docker image
The Fluent Bit compiled on Ubuntu functions perfectly.
root@root:/home/ermao/fluent-bit/build/bin# ./fluent-bit --version
Fluent Bit v3.0.4
Git commit: ce7aafab40daa2c9f86d4f43d9ce67ff94d57d50
root@root:/home/ermao/fluent-bit/build/bin# ./fluent-bit -c fluent-bit.conf
Fluent Bit v3.0.4
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
___________.__ __ __________.__ __ ________
\_ _____/| | __ __ ____ _____/ |_ \______ \__|/ |_ ___ _\_____ \
| __) | | | | \_/ __ \ / \ __\ | | _/ \ __\ \ \/ / _(__ <
| \ | |_| | /\ ___/| | \ | | | \ || | \ / / \
\___ / |____/____/ \___ >___| /__| |______ /__||__| \_/ /______ /
\/ \/ \/ \/ \/
[2024/05/28 03:26:22] [ info] Configuration:
[2024/05/28 03:26:22] [ info] flush time | 1.000000 seconds
[2024/05/28 03:26:22] [ info] grace | 5 seconds
[2024/05/28 03:26:22] [ info] daemon | 0
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info] inputs:
[2024/05/28 03:26:22] [ info] tail
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info] filters:
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info] outputs:
[2024/05/28 03:26:22] [ info] kafka.0
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info] collectors:
[2024/05/28 03:26:22] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=973
[2024/05/28 03:26:22] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/05/28 03:26:22] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/28 03:26:22] [ info] [cmetrics] version=0.9.0
[2024/05/28 03:26:22] [ info] [ctraces ] version=0.5.1
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] initializing
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/28 03:26:22] [debug] [tail:tail.0] created event channels: read=21 write=22
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] multiline core started
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] inotify watch fd=31
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] scanning path /tmp/test*
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] cannot read info from: /tmp/test*
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] 0 new files found on path '/tmp/test*'
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=1
[2024/05/28 03:26:22] [debug] [kafka:kafka.0] created event channels: read=32 write=33
[2024/05/28 03:26:22] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/28 03:26:22] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/28 03:26:22] [ info] [sp] stream processor started
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2024/05/28 03:26:48] [engine] caught signal (SIGHUP)
[2024/05/28 03:26:48] [ info] reloading instance pid=973 tid=0x7f3c406faec0
[2024/05/28 03:26:48] [ info] [reload] stop everything of the old context
[2024/05/28 03:26:48] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2024/05/28 03:26:48] [ info] [input] pausing tail.0
[2024/05/28 03:26:49] [ info] [engine] service has stopped (0 pending tasks)
[2024/05/28 03:26:49] [ info] [input] pausing tail.0
[2024/05/28 03:26:49] [ info] [reload] start everything
[2024/05/28 03:26:49] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=973
[2024/05/28 03:26:49] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/05/28 03:26:49] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/28 03:26:49] [ info] [cmetrics] version=0.9.0
[2024/05/28 03:26:49] [ info] [ctraces ] version=0.5.1
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] initializing
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/28 03:26:49] [debug] [tail:tail.0] created event channels: read=16 write=17
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] multiline core started
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] inotify watch fd=26
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] scanning path /tmp/test*
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] cannot read info from: /tmp/test*
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] 0 new files found on path '/tmp/test*'
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=0
[2024/05/28 03:26:49] [debug] [kafka:kafka.0] created event channels: read=27 write=28
[2024/05/28 03:26:49] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/28 03:26:49] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/28 03:26:49] [ info] [sp] stream processor started
Encountered some more weirdness after a fresh cluster and retesting:
seems the reloader fails now... 2024/07/03 09:33:13 config map updated 2024/07/03 09:33:13 performing webhook request (1/1) 2024/07/03 09:33:13 error: Received response code 400 , expected 200 2024/07/03 09:33:23 error: Webhook reload retries exhausted 2024/07/03 09:50:03 config map updated 2024/07/03 09:50:03 performing webhook request (1/1) 2024/07/03 09:50:03 error: Received response code 400 , expected 200 2024/07/03 09:50:13 error: Webhook reload retries exhausted
first time it does that... with me editting the confgimap output host to test
After trying the webhook from inside the container:
{"reload":"in progress","status":-2}
we got that 3-4 times already and it does not reload
so it seems there might be some bug with fluentbit hotreload function There is already a bug report open https://github.com/fluent/fluent-bit/issues/8918 That seems to confirm hot reload is not working well and fails randomly.
I dont think we can do anything on our end until they actually bother to fix it in fluentbit...
only workaround is to set recreate_pods in the helm release to recycle the pods.. it is after all rolling upgrade should not disrupt and only does it when there is an upgrade of the release due to config change etc
3.0.7 also
ok, got it you should carefully read logs the reason is somewhere up there :)
I think I've found the key to the problem, Parsers_File must be written with an absolute path, if you use a relative path the hot reloading fails.
@im-jinxinwang that's ... interesting. That should be an easy enough fix I would assume from within the code base?
EDIT: @lecaros @edsiper @PettitWesley is this something you all can look into or know who best can take things on? We've resulted to falling back to triggering a rolling restart as the hot-reload functionality simply does not work with relative paths.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This is a real problem for us as we have to rely on doing a rolling restart instead of hot reloading. Can someone from the fluentbit community please look into things and why hot-reload does not work with dynamic-configmaps.
Bug Report
Describe the bug
When attempting to perform a hot reload with Fluent Bit version 3.0.2, the following error is encountered:
fluent-bit.conf
parsers.conf
To Reproduce
Rubular link if applicable:
Example log message if applicable:
Steps to reproduce the problem:
Expected behavior
Screenshots
Your Environment
Additional context