Open jim-barber-he opened 7 months ago
I am also experiencing this exact issue with Fluent Bit version 3.0.3, running on our self-hosted Kubernetes cluster (Kubernetes version: v1.28.3, Host: Ubuntu 22.04), Thanks!
Facing the same issue on aarch64 (ARMv8) nodes in a Kubernetes 1.25 environment since version 3 (also tried latest 3.0.4).
UPDATE: we are currently still using 2.2.3 due to this error. In 2.2.3 this error happens too but very rarely in comparison to v3. So, maybe some related library/framework used in the background changed and this info helps to figure out which one.
I have same issue, too. We were using 2.1.3 and there were many crash in the past. I updated it to latest 3.1.6 and the crash still happens.
From my observation, this issue happens frequently on nodes where we create pod frequently. On node where pod rarely killed or restarted have no issue. On Grafana, currently I have set memory limit to 2GB and I don't see out of memory from Grafana.
Now I'm trying 2.2.2 and 2.2.3 as suggested by above comments.
We are experiencing the same issue described in the comment above, with fluent-bit crashing on the nodes where pods are frequently created. We also tested with fluent-bit version 3.1.8 and still observed the issue.
The issue is still happening in Fluent Bit v3.1.9
[2024/10/07 18:35:23] [engine] caught signal (SIGSEGV)
#0 0x561cefcb8df9 in flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1 0x561cefcb8df9 in flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2 0x561cefcb6cfc in flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3 0x561cefce7d7f in ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4 0x561cefce7d7f in ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5 0x561cefc9a457 in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1516
#6 0x561cefc9ab84 in flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7 0x561cefcb9cec in flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8 0x561cefce84c4 in flb_tail_file_remove() at plugins/in_tail/tail_file.c:1256
#9 0x561cefce43c6 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x561cefc6a15a in flb_input_collector_fd() at src/flb_input.c:1970
#11 0x561cefc83d23 in flb_engine_handle_event() at src/flb_engine.c:575
#12 0x561cefc83d23 in flb_engine_start() at src/flb_engine.c:941
#13 0x561cefc5f163 in flb_lib_worker() at src/flb_lib.c:674
#14 0x7f8314fdf143 in ???() at ???:0
#15 0x7f831505f7db in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0
[2024/10/07 13:33:06] [engine] caught signal (SIGSEGV)
#0 0x55f7afab6df9 in flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1 0x55f7afab6df9 in flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2 0x55f7afab4cfc in flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3 0x55f7afae5d7f in ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4 0x55f7afae5d7f in ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5 0x55f7afa98457 in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1516
#6 0x55f7afa98b84 in flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7 0x55f7afab7cec in flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8 0x55f7afae64c4 in flb_tail_file_remove() at plugins/in_tail/tail_file.c:1256
#9 0x55f7afae23c6 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x55f7afa6815a in flb_input_collector_fd() at src/flb_input.c:1970
#11 0x55f7afa81d23 in flb_engine_handle_event() at src/flb_engine.c:575
#12 0x55f7afa81d23 in flb_engine_start() at src/flb_engine.c:941
#13 0x55f7afa5d163 in flb_lib_worker() at src/flb_lib.c:674
#14 0x7f6a338c0143 in ???() at ???:0
#15 0x7f6a339407db in ???() at ???:0
#16 0xffffffffffffffff in ???() at ???:0
I can confirm that version 2.2.2 and 2.2.3 are good
Bug Report
Describe the bug
Every now and then we see a crash on one of our fluent-bit pods in our Kubernetes clusters. These crashes started since we upgraded from fluent-bit version version
2.2.2
(which was rock stable) to version3.0.2
and still persisting in version3.0.3
.To Reproduce
The logs for the lastest crash look like so:
The stack trace always looks the same and there are no errors logged beforehand.
I don't know how to reproduce the problem as it seems to happen at random times.
Expected behavior
No crashes.
Screenshots
Your Environment
Version
v3.0.3
We deploy fluent-bit via the helm chart hosted at https://fluent.github.io/helm-charts The values we provide to it follow.
Kubernetes
versionv1.28.9
AWS EC2 instances.
Kubernetes is running on Ubuntu 20.04 hosts.
Should be covered by the configuration shown above.
Additional context
If you think the crashes could be caused by functions.lua then I can also supply the content for that.