fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.84k stars 1.58k forks source link

Segmentation Fault (Exit Code 139) Causing Fluent Bit Pods to Restart in EKS #9544

Open amchech opened 4 days ago

amchech commented 4 days ago

Bug Report

Description We are experiencing occasional restarts of Fluent Bit pods running as a DaemonSet in our EKS cluster. The pods are restarting with an exit code of 139 (segmentation fault). According to our Prometheus metrics, the issue is not caused by a running out of memory nor CPU usage.

Logs

[2024/10/11 07:19:28] [engine] caught signal (SIGSEGV)
#0  0x562fdf2a5df9      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dyn0
#1  0x562fdf2a5df9      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_fi
#2  0x562fdf2a3d5c      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x562fdf2d30cf      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x562fdf2d30cf      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x562fdf288927      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1515
#6  0x562fdf289085      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x562fdf2a6dcc      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x562fdf2d385c      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1249
#9  0x562fdf2cf5b5      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x562fdf2588e4      in  flb_input_collector_fd() at src/flb_input.c:1949
#11 0x562fdf2726d7      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x562fdf2726d7      in  flb_engine_start() at src/flb_engine.c:941
#13 0x562fdf24e1a3      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7f7f630f2ea6      in  ???() at ???:0
#15 0x7f7f629a6a6e      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

Environment Fluent Bit Version: version=3.0.6, commit=9af65e2c36 Note we already update to version=3.1.9, commit=431fa79ae2 and we have same issue. Kubernetes Version: v1.29.0 EKS Version: v1.29.0-eks-680e576 Node Operating System: Bottlerocket OS 1.21.1 (aws-k8s-1.29) kernel 6.1.102 Container Runtime: containerd://1.7.20+bottlerocket Node Configuration: CPU: 4 vCPU Memory: 8GB Instance Type: c6a.xlarge

Deployment in EKS Fluent Bit is deployed as a Daemon Set in an EKS cluster. Resource limits and requests are set for memory and CPU.

resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi

Additional context Attached you find log files and fluentbit configs. fleuntbitlog.txt custom_parser.txt fluent-bit.txt

patrick-stephens commented 3 days ago

To help others:

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    Parsers_File /fluent-bit/etc/parsers.conf
    Parsers_File /fluent-bit/etc/conf/custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    # Exclude fluent-bit logs, certain error conditions can cause loops
    # that can effectively DoS outputs with very high logging rates
    # (see https://github.com/fluent/fluent-bit/issues/3829)
    Exclude_Path /var/log/containers/fluent-bit-*_kube-system_*.log
    multiline.parser docker, cri
    Tag kube.<namespace_name>.<pod_name>.<container_name>-<container_id>
    Mem_Buf_Limit 5MB
    Skip_Long_Lines On
    DB /var/log/flb_pods_tail.db
    Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<container_id>[a-z0-9]{64})\.log$
[INPUT]
    Name     tail
    Path     /usr/share/reactshost/*/ReactsLogs/Metrics/*/*.json
    Tag      reacts-metrics
    Parser   reacts-metrics-parser
    Path_Key filename
    DB       /usr/share/reactshost/fluentbit/logs.db

[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On
    Kube_Tag_Prefix kube.
    Regex_Parser kubePodCustom
[FILTER]
    Name rewrite_tag
    Match kube.*
    Rule $kubernetes['pod_id'] ^.*4.*$ cw.$TAG true
    Emitter_Name cw_re_emitted
[FILTER]
    Name grep
    Match cw.*
    Exclude $kubernetes['labels']['logging.cloudwatch.aws/enabled'] false
[FILTER]
    Name grep
    Match kube.*
    Exclude $kubernetes['namespace_name'] loki-system    
[FILTER]
    Name modify
    Match kube.*
    Rename level level_label
    Rename instance instance_label   
[FILTER]
    Name         parser
    Match        reacts-metrics
    Key_Name     filename
    Parser       filename-parser
    Reserve_Data On

[OUTPUT]
    Name loki
    Match kube.*
    Host loki-gateway.loki-system
    Port 80
    labels job=fluentbit, type=logs, namespace=$kubernetes['namespace_name'], component=$kubernetes['container_name'], level=$level_label, instance=$instance_label
[OUTPUT]
    Name   loki
    Match  reacts-metrics
    Host   loki-gateway.loki-system
    Port   80
    Labels job=fluentbit, component=$component, instance=$instance, type=metrics