fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.86k stars 1.58k forks source link

Error Encountered During Hot Reload in Fluent Bit 3.0.2 #8817

Open im-jinxinwang opened 6 months ago

im-jinxinwang commented 6 months ago

Bug Report

Describe the bug

When attempting to perform a hot reload with Fluent Bit version 3.0.2, the following error is encountered:

[2024/05/11 16:19:17] [error] [multiline] parser 'exception_test' not registered
[2024/05/11 16:19:17] [error] [input:tail:tail.0] could not load multiline parsers
[2024/05/11 16:19:17] [error] failed initialize input tail.0
[2024/05/11 16:19:17] [error] [engine] input initialization failed
[2024/05/11 16:19:18] [error] [reload] loaded configuration contains error(s). Reloading is aborted

fluent-bit.conf

[SERVICE]
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    38085
    Hot_Reload   On
    Flush        1
    Grace        5
    Daemon       Off
    Log_Level    info
    Parsers_File parsers.conf
[INPUT]
    Name         tail
    Tag              foundation-audit
    DB  /fluent-bit/log/tail-containers-state.db
    Read_from_Head true
    Skip_Empty_Lines true
    Path  /tmp/test*
    Path_Key  pod_log_path
    multiline.parser exception_test
[OUTPUT]
    Name        kafka
    Match       *
    Brokers     192.168.123.50:9092
    Topics      testlog

parsers.conf

[MULTILINE_PARSER]
    name          exception_test
    type          regex
    flush_timeout 1000
    rule          "start_state"  "/(Dec \d+ \d+\:\d+\:\d+)(.*)/" "cont"
    rule          "cont" "/^\s+at.*/" "cont"

To Reproduce

Expected behavior

Screenshots

image image

Your Environment

Additional context

lecaros commented 6 months ago

Hi @im-jinxinwang, is this reproducible for you building from master ?

I just tried it, and it works (minus the kafka, but not issue hot-reloading).

fluent-bit -c 8817.conf
Fluent Bit v3.0.4
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/11 17:29:25] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=39607
[2024/05/11 17:29:25] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/11 17:29:25] [ info] [cmetrics] version=0.9.0
[2024/05/11 17:29:25] [ info] [ctraces ] version=0.5.1
[2024/05/11 17:29:25] [ info] [input:tail:tail.0] initializing
[2024/05/11 17:29:25] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/11 17:29:25] [ info] [input:tail:tail.0] multiline core started
[2024/05/11 17:29:25] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/11 17:29:25] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/11 17:29:25] [ info] [sp] stream processor started
[2024/05/11 17:29:32] [engine] caught signal (SIGHUP)
[2024/05/11 17:29:32] [ info] reloading instance pid=39607 tid=0x2053d3ac0
[2024/05/11 17:29:32] [ info] [reload] stop everything of the old context
[2024/05/11 17:29:32] [ info] [input] pausing tail.0
[2024/05/11 17:29:33] [ info] [reload] start everything
[2024/05/11 17:29:33] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=39607
[2024/05/11 17:29:33] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/11 17:29:33] [ info] [cmetrics] version=0.9.0
[2024/05/11 17:29:33] [ info] [ctraces ] version=0.5.1
[2024/05/11 17:29:33] [ info] [input:tail:tail.0] initializing
[2024/05/11 17:29:33] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/11 17:29:33] [ info] [input:tail:tail.0] multiline core started
[2024/05/11 17:29:33] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/11 17:29:33] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/11 17:29:33] [ info] [sp] stream processor started
[2024/05/11 17:30:03] [ warn] [output:kafka:kafka.0] fluent-bit#producer-2: [thrd:192.168.123.50:9092/bootstrap]: 192.168.123.50:9092/bootstrap: Connection setup timed out in state CONNECT (after 30026ms in state CONNECT)
[2024/05/11 17:30:03] [error] [output:kafka:kafka.0] fluent-bit#producer-2: [thrd:192.168.123.50:9092/bootstrap]: 1/1 brokers are down
[2024/05/11 17:30:34] [ warn] [output:kafka:kafka.0] fluent-bit#producer-2: [thrd:192.168.123.50:9092/bootstrap]: 192.168.123.50:9092/bootstrap: Connection setup timed out in state CONNECT (after 30025ms in state CONNECT, 1 identical error(s) suppressed)
im-jinxinwang commented 6 months ago

Hi @lecaros I see that the version you started is different from mine. I have the same problem in both versions 3.0.2 and 3.0.3 that hot reloading is not possible.

im-jinxinwang commented 6 months ago

I compiled the latest code, and 3.0.4 does not have this problem, but 3.0.4 has not been released.

hatmen commented 5 months ago

Hi @lecaros When using Kafka output and hot loading, I also encountered the same problem and verified that 2.2.2, 3.0.3, 3.0.4, and 3.0.5 all had similar issues.

fluent-bit.conf

[SERVICE]
    flush 10
    daemon Off
    log_level trace
    Hot_Reload   On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020
    HTTP_Server  On
    parsers_file /etc/fluent-bit/parsers.conf
    plugins_file /etc/fluent-bit/plugins.conf

[INPUT]
        Name random
        Tag test4
        Samples 10

[FILTER]
        Name modify
        Match test4
        Add flb_topic flb-baron-test

[OUTPUT]
    Name        kafka
    Match       *
    Brokers     192.168.7.10:9092,192.168.7.11:9092
    Topics      fluent-logs
    topic_key   flb_topic
    Dynamic_Topic on

Describe the bug

[2024/05/25 03:31:25] [ info] [sp] stream processor started
[2024/05/25 03:31:26] [trace] [filter:modify:modify.0 at /tmp/fluent-bit/plugins/filter_modify/modify.c:1426] Input map size 1 elements, output map size 2 elements
[2024/05/25 03:31:26] [debug] [input chunk] update output instances with new chunk size diff=67, records=1, input=random.0
[2024/05/25 03:31:26] [engine] caught signal (SIGHUP)
[2024/05/25 03:31:26] [ info] reloading instance pid=3163754 tid=0x7f4391275340
[2024/05/25 03:31:26] [ info] [reload] stop everything of the old context
[2024/05/25 03:31:26] [trace] [engine] flush enqueued data
[2024/05/25 03:31:26] [trace] [task 0x7f438f836c80] created (id=0)
[2024/05/25 03:31:26] [debug] [task] created task=0x7f438f836c80 id=0 OK
[2024/05/25 03:31:26] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2024/05/25 03:31:26] [ info] [input] pausing random.0
{"[2024/05/25 03:31:26] [debug] in produce_message

rand_value"=>10952961381925857816, "flb_topic"=>"flb-baron-test"}[2024/05/25 03:31:26] [ info] [out_kafka] new topic added: flb-baron-test
[2024/05/25 03:31:26] [debug] [output:kafka:kafka.0] enqueued message (95 bytes) for topic 'flb-baron-test'
[2024/05/25 03:31:26] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2024/05/25 03:31:26] [debug] [out flush] cb_destroy coro_id=0
[2024/05/25 03:31:26] [trace] [coro] destroy coroutine=0x7f438f80bc38 data=0x7f438f80bc50
[2024/05/25 03:31:26] [debug] [task] destroy task=0x7f438f836c80 (task_id=0)
[2024/05/25 03:31:27] [ info] [engine] service has stopped (0 pending tasks)
[2024/05/25 03:31:27] [ info] [input] pausing random.0
[2024/05/25 03:31:27] [debug] [output:kafka:kafka.0] message delivered (95 bytes, partition 0)
[2024/05/25 03:31:27] [engine] caught signal (SIGSEGV)
#0  0x4913b6            in  ???() at ???:0
#1  0x526459            in  ???() at ???:0
#2  0x527732            in  ???() at ???:0
#3  0x513cbb            in  ???() at ???:0
#4  0x513d3d            in  ???() at ???:0
#5  0x54d52a            in  ???() at ???:0
#6  0x54f2ef            in  ???() at ???:0
#7  0x54f403            in  ???() at ???:0
#8  0x57763d            in  ???() at ???:0
#9  0x577270            in  ???() at ???:0
#10 0x51866b            in  ???() at ???:0
#11 0x7f439109f7f1      in  ???() at ???:0
#12 0x7f439103f44f      in  ???() at ???:0
#13 0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)
im-jinxinwang commented 5 months ago

I still can't understand it. I compiled it myself without any problem. But this problem will occur when using their docker image image image

im-jinxinwang commented 5 months ago

The Fluent Bit compiled on Ubuntu functions perfectly.

root@root:/home/ermao/fluent-bit/build/bin# ./fluent-bit --version
Fluent Bit v3.0.4
Git commit: ce7aafab40daa2c9f86d4f43d9ce67ff94d57d50
root@root:/home/ermao/fluent-bit/build/bin# ./fluent-bit -c fluent-bit.conf 
Fluent Bit v3.0.4
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/28 03:26:22] [ info] Configuration:
[2024/05/28 03:26:22] [ info]  flush time     | 1.000000 seconds
[2024/05/28 03:26:22] [ info]  grace          | 5 seconds
[2024/05/28 03:26:22] [ info]  daemon         | 0
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info]  inputs:
[2024/05/28 03:26:22] [ info]      tail
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info]  filters:
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info]  outputs:
[2024/05/28 03:26:22] [ info]      kafka.0
[2024/05/28 03:26:22] [ info] ___________
[2024/05/28 03:26:22] [ info]  collectors:
[2024/05/28 03:26:22] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=973
[2024/05/28 03:26:22] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/05/28 03:26:22] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/28 03:26:22] [ info] [cmetrics] version=0.9.0
[2024/05/28 03:26:22] [ info] [ctraces ] version=0.5.1
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] initializing
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/28 03:26:22] [debug] [tail:tail.0] created event channels: read=21 write=22
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] multiline core started
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] inotify watch fd=31
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] scanning path /tmp/test*
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] cannot read info from: /tmp/test*
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] 0 new files found on path '/tmp/test*'
[2024/05/28 03:26:22] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=1
[2024/05/28 03:26:22] [debug] [kafka:kafka.0] created event channels: read=32 write=33
[2024/05/28 03:26:22] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/28 03:26:22] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/28 03:26:22] [ info] [sp] stream processor started
[2024/05/28 03:26:22] [debug] [input:tail:tail.0] [static files] processed 0b, done

[2024/05/28 03:26:48] [engine] caught signal (SIGHUP)
[2024/05/28 03:26:48] [ info] reloading instance pid=973 tid=0x7f3c406faec0
[2024/05/28 03:26:48] [ info] [reload] stop everything of the old context
[2024/05/28 03:26:48] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2024/05/28 03:26:48] [ info] [input] pausing tail.0
[2024/05/28 03:26:49] [ info] [engine] service has stopped (0 pending tasks)
[2024/05/28 03:26:49] [ info] [input] pausing tail.0
[2024/05/28 03:26:49] [ info] [reload] start everything
[2024/05/28 03:26:49] [ info] [fluent bit] version=3.0.4, commit=ce7aafab40, pid=973
[2024/05/28 03:26:49] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2024/05/28 03:26:49] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/05/28 03:26:49] [ info] [cmetrics] version=0.9.0
[2024/05/28 03:26:49] [ info] [ctraces ] version=0.5.1
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] initializing
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2024/05/28 03:26:49] [debug] [tail:tail.0] created event channels: read=16 write=17
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] multiline core started
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] inotify watch fd=26
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] scanning path /tmp/test*
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] cannot read info from: /tmp/test*
[2024/05/28 03:26:49] [debug] [input:tail:tail.0] 0 new files found on path '/tmp/test*'
[2024/05/28 03:26:49] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=0
[2024/05/28 03:26:49] [debug] [kafka:kafka.0] created event channels: read=27 write=28
[2024/05/28 03:26:49] [ info] [output:kafka:kafka.0] brokers='192.168.123.50:9092' topics='testlog'
[2024/05/28 03:26:49] [ info] [http_server] listen iface=0.0.0.0 tcp_port=38085
[2024/05/28 03:26:49] [ info] [sp] stream processor started
iblexisnexis commented 4 months ago

Encountered some more weirdness after a fresh cluster and retesting:

seems the reloader fails now... 2024/07/03 09:33:13 config map updated 2024/07/03 09:33:13 performing webhook request (1/1) 2024/07/03 09:33:13 error: Received response code 400 , expected 200 2024/07/03 09:33:23 error: Webhook reload retries exhausted 2024/07/03 09:50:03 config map updated 2024/07/03 09:50:03 performing webhook request (1/1) 2024/07/03 09:50:03 error: Received response code 400 , expected 200 2024/07/03 09:50:13 error: Webhook reload retries exhausted

first time it does that... with me editting the confgimap output host to test

iblexisnexis commented 4 months ago

After trying the webhook from inside the container:

{"reload":"in progress","status":-2}

we got that 3-4 times already and it does not reload

so it seems there might be some bug with fluentbit hotreload function There is already a bug report open https://github.com/fluent/fluent-bit/issues/8918 That seems to confirm hot reload is not working well and fails randomly.

I dont think we can do anything on our end until they actually bother to fix it in fluentbit...

only workaround is to set recreate_pods in the helm release to recycle the pods.. it is after all rolling upgrade should not disrupt and only does it when there is an upgrade of the release due to config change etc

chrono2002 commented 4 months ago

3.0.7 also

chrono2002 commented 4 months ago

ok, got it you should carefully read logs the reason is somewhere up there :)

im-jinxinwang commented 4 months ago

I think I've found the key to the problem, Parsers_File must be written with an absolute path, if you use a relative path the hot reloading fails.

cdancy commented 4 months ago

@im-jinxinwang that's ... interesting. That should be an easy enough fix I would assume from within the code base?

EDIT: @lecaros @edsiper @PettitWesley is this something you all can look into or know who best can take things on? We've resulted to falling back to triggering a rolling restart as the hot-reload functionality simply does not work with relative paths.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

cdancy commented 1 month ago

This is a real problem for us as we have to rely on doing a rolling restart instead of hot reloading. Can someone from the fluentbit community please look into things and why hot-reload does not work with dynamic-configmaps.