Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS

fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows

Apache License 2.0

5.84k stars 1.58k forks source link

Bug Report

Description We are experiencing occasional restarts of Fluent Bit pods running as a DaemonSet in our EKS cluster. The pods are restarting with an exit code of 139 (segmentation fault). According to our Prometheus metrics, the issue is not caused by a running out of memory nor CPU usage.

Logs

[2024/10/11 07:19:28] [engine] caught signal (SIGSEGV)
#0  0x562fdf2a5df9      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dyn0
#1  0x562fdf2a5df9      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_fi
#2  0x562fdf2a3d5c      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x562fdf2d30cf      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x562fdf2d30cf      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x562fdf288927      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1515
#6  0x562fdf289085      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x562fdf2a6dcc      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x562fdf2d385c      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1249
#9  0x562fdf2cf5b5      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x562fdf2588e4      in  flb_input_collector_fd() at src/flb_input.c:1949
#11 0x562fdf2726d7      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x562fdf2726d7      in  flb_engine_start() at src/flb_engine.c:941
#13 0x562fdf24e1a3      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7f7f630f2ea6      in  ???() at ???:0
#15 0x7f7f629a6a6e      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

Environment Fluent Bit Version: version=3.0.6, commit=9af65e2c36 Note we already update to version=3.1.9, commit=431fa79ae2 and we have same issue. Kubernetes Version: v1.29.0 EKS Version: v1.29.0-eks-680e576 Node Operating System: Bottlerocket OS 1.21.1 (aws-k8s-1.29) kernel 6.1.102 Container Runtime: containerd://1.7.20+bottlerocket Node Configuration: CPU: 4 vCPU Memory: 8GB Instance Type: c6a.xlarge

Deployment in EKS Fluent Bit is deployed as a Daemon Set in an EKS cluster. Resource limits and requests are set for memory and CPU.

resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi

Additional context Attached you find log files and fluentbit configs. fleuntbitlog.txt custom_parser.txt fluent-bit.txt

[SERVICE] Daemon Off Flush 1 Log_Level info Parsers_File /fluent-bit/etc/parsers.conf Parsers_File /fluent-bit/etc/conf/custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 Health_Check On [INPUT] Name tail Path /var/log/containers/*.log # Exclude fluent-bit logs, certain error conditions can cause loops # that can effectively DoS outputs with very high logging rates # (see https://github.com/fluent/fluent-bit/issues/3829) Exclude_Path /var/log/containers/fluent-bit-*_kube-system_*.log multiline.parser docker, cri Tag kube.<namespace_name>.<pod_name>.<container_name>-<container_id> Mem_Buf_Limit 5MB Skip_Long_Lines On DB /var/log/flb_pods_tail.db Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<container_id>[a-z0-9]{64})\.log$ [INPUT] Name tail Path /usr/share/reactshost/*/ReactsLogs/Metrics/*/*.json Tag reacts-metrics Parser reacts-metrics-parser Path_Key filename DB /usr/share/reactshost/fluentbit/logs.db [FILTER] Name kubernetes Match kube.* Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On Kube_Tag_Prefix kube. Regex_Parser kubePodCustom [FILTER] Name rewrite_tag Match kube.* Rule $kubernetes['pod_id'] ^.*4.*$ cw.$TAG true Emitter_Name cw_re_emitted [FILTER] Name grep Match cw.* Exclude $kubernetes['labels']['logging.cloudwatch.aws/enabled'] false [FILTER] Name grep Match kube.* Exclude $kubernetes['namespace_name'] loki-system [FILTER] Name modify Match kube.* Rename level level_label Rename instance instance_label [FILTER] Name parser Match reacts-metrics Key_Name filename Parser filename-parser Reserve_Data On [OUTPUT] Name loki Match kube.* Host loki-gateway.loki-system Port 80 labels job=fluentbit, type=logs, namespace=$kubernetes['namespace_name'], component=$kubernetes['container_name'], level=$level_label, instance=$instance_label [OUTPUT] Name loki Match reacts-metrics Host loki-gateway.loki-system Port 80 Labels job=fluentbit, component=$component, instance=$instance, type=metrics

fluent / fluent-bit

Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

Bug Report