fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.9k stars 1.59k forks source link

fluent-bit v3 occassionally crashes with SIGSEGV #8779

Open jim-barber-he opened 7 months ago

jim-barber-he commented 7 months ago

Bug Report

Describe the bug

Every now and then we see a crash on one of our fluent-bit pods in our Kubernetes clusters. These crashes started since we upgraded from fluent-bit version version 2.2.2 (which was rock stable) to version 3.0.2 and still persisting in version 3.0.3.

To Reproduce

The logs for the lastest crash look like so:

$ kubectl logs -p fluent-bit-5fj89                                   
Fluent Bit v3.0.3
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/01 01:00:55] [engine] caught signal (SIGSEGV)
#0  0x55b2289a4c69      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1  0x55b2289a4c69      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2  0x55b2289a2bcc      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x55b2289d1f3f      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x55b2289d1f3f      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x55b228987797      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1515
#6  0x55b228987ef5      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x55b2289a5c3c      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x55b2289d26cc      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1249
#9  0x55b2289ce425      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x55b2289578d4      in  flb_input_collector_fd() at src/flb_input.c:1949
#11 0x55b228971547      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x55b228971547      in  flb_engine_start() at src/flb_engine.c:941
#13 0x55b22894d193      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7fa5729aaea6      in  ???() at ???:0
#15 0x7fa57225ea6e      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

The stack trace always looks the same and there are no errors logged beforehand.

I don't know how to reproduce the problem as it seems to happen at random times.

Expected behavior

No crashes.

Screenshots

Your Environment

Version v3.0.3

We deploy fluent-bit via the helm chart hosted at https://fluent.github.io/helm-charts The values we provide to it follow.

config:
  service: |
    [SERVICE]
        Daemon Off
        Flush 5
        HTTP_Server On
        Log_Level warn
        Parsers_File custom_parsers.conf
        Parsers_File parsers.conf

  inputs: |
    [INPUT]
        Buffer_Chunk_Size 800k
        Buffer_Max_Size 20MB
        DB /var/log/containers/fluent-bit.db
        DB.locking true
        Multiline.parser cri, docker
        Name tail
        Path /var/log/containers/*.log
        Skip_Long_Lines Off
        Tag kube.*
    [INPUT]
        Buffer_Chunk_Size 512k
        Buffer_Max_Size 5MB
        DB /var/log/kube-audit.db
        DB.locking true
        Name tail
        Parser kube_audit
        Path /var/log/kube-audit.log
        Tag kube_audit
    [INPUT]
        Name systemd
        Read_From_Tail On
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Tag host.*

  filters: |
    [FILTER]
        # If the log has been broken up into multiple lines, rejoin them.
        Match                 *
        Mode                  partial_message
        Multiline.key_content log
        Name                  multiline

    [FILTER]
        # Enrich with kube data
        Annotations Off
        Buffer_Size 512k
        K8S-Logging.Parser On
        Labels Off
        Match kube.*
        Merge_Log On
        Name kubernetes

    [FILTER]
        Name                lua
        Match               kube.*
        # lua function to get the size of a log message
        script              /fluent-bit/scripts/functions.lua
        call                add_size

    # Put rewrite_tag rules early to reduce the amount of work required.
    # That's because a new message is produced with the new tag and gets processed by all the filters again starting from the top.
    # However it needs to be after the `kubernetes` filter since that requires the original tag name to enrich the data.
    # Ruby regex format can be tested at https://rubular.com/
    # Spaces are not allowed in the regex since it needs to be one contiguous string so use \s instead.
    [FILTER]
        # Retag Apache logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
        # The new tag isn't called `apache-high-precision` because the *apache* match later chooses the heweb_apache parser.
        Match *apache*
        Name rewrite_tag
        #    $KEY  REGEX                                                                     NEW_TAG             KEEP
        Rule $log  ^[^\[]+\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\]          web-high-precision  false
    [FILTER]
        # Retag Nginx logs where the timestamp has the format of %d/%b/%Y:%H:%M:%S.%L instead of just %d/%b/%Y:%H:%M:%S
        # The new tag isn't called `nginx-high-precision` because the *nginx* match later chooses the nginx_common parser.
        Match *nginx*
        Name rewrite_tag
        #    $KEY  REGEX                                                                     NEW_TAG             KEEP
        Rule $log  ^\S+\s\S+\s\S+\s\[\d{1,2}\/[^\/]{3}\/\d{4}:\d\d:\d\d:\d\d\.\d+\s\+\d{4}\] ngx-high-precision  false
    [FILTER]
        Match *php*
        Name rewrite_tag
        #    $KEY     REGEX      NEW_TAG       KEEP
        Rule $channel AccessLogs Security_logs false
        Rule $channel Security   Security_logs false

    [FILTER]
        # Lift the kubernetes records so we can remove the unrequired ones
        Add_prefix kubernetes.
        Match kube.*
        Name nest
        Nested_under kubernetes
        Operation lift
    [FILTER]
        # Remove unrequired kubernetes fields
        Match kube.*
        Name record_modifier
        Remove_key kubernetes.container_hash
        Remove_key kubernetes.docker_id
        Remove_key kubernetes.container_image
    [FILTER]
        # We plan to remove the php-logs container. This is a hack to keep existing saved searches and alerts working when we do.
        Condition Key_Value_Equals kubernetes.container_name php-fpm
        Condition Key_Exists monolog_level
        Match kube.*
        Name modify
        Set kubernetes.container_name php-logs
    [FILTER]
        # Lift the kubernetes.labels as well so we can modify the `app` one.
        Add_prefix kubernetes.labels.
        Match kube.*
        Name nest
        Nested_under kubernetes.labels
        Operation lift
    [FILTER]
        # Rename the `app` label to `app.kubernetes.io/name` to stop the clash with all the `app.kubernetes.io/*` labels.
        Hard_rename kubernetes.labels.app kubernetes.labels.app.kubernetes.io/name
        Match kube.*
        Name modify
    [FILTER]
        # Nest the Kubernetes labels back again.
        Match kube.*
        Name nest
        Nest_under kubernetes.labels
        Operation nest
        Remove_prefix kubernetes.labels.
        Wildcard kubernetes.labels.*
    [FILTER]
        # Nest the kubenetes records back under kubernetes
        Match kube.*
        Name nest
        Nest_under kubernetes
        Operation nest
        Remove_prefix kubernetes.
        Wildcard kubernetes.*

    [FILTER]
        # Send apache logs to the heweb_apache parser
        Key_Name log
        Match *apache*
        Name parser
        Parser heweb_apache
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        # Send apache logs with a high time precision to the heweb_apache parser
        Key_Name log
        Match *web-high-precision*
        Name parser
        Parser heweb_apache_high_precision
        Preserve_Key True
        Reserve_Data True

    # Fix up logs so that all of them can be ingested by Elasticsearch.
    [FILTER]
        Name lua
        Match *
        script /fluent-bit/scripts/functions.lua
        call clean_logs

    [FILTER]
        # A number of fields share their names with other JSON logs being imported but with clashing data types.
        # Just place all of them below a `kube_audit` key to avoid that.
        Match kube_audit
        Name nest
        Nest_Under kube_audit
        Operation nest
        Wildcard *

    [FILTER]
        # Send nginx logs to the nginx_common parser
        Key_Name log
        Match *nginx*
        Name parser
        Parser nginx_common
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        # Send Nginx logs with a high time precision to the nginx_high_precision parser
        Key_Name log
        Match *ngx-high-precision*
        Name parser
        Parser nginx_high_precision
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        # regex stdout resque log format
        Exclude log ^\[notice\]*
        Match  *heweb_resque*
        Name   grep
    [FILTER]
        Add_prefix logEvent_
        Match  *elasticsearch*
        Name   nest
        Nested_under logEvent.url
        Operation lift
    [FILTER]
        # regex stdout resque log format
        Key_Name log
        Match *heweb_resque*
        Name parser
        Parser resque_stdout
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        # Populate cluster_name field
        Match *
        Name record_modifier
        Record cluster_name prod2.he0.io
    [FILTER]
        # Drop unneeded records as docker handles ts
        # These fields result in ES errors
        Match *
        Name record_modifier
        Remove_key params
        Remove_key time
        Remove_key ts
        Remove_key url
        Remove_key logEvent
        Remove_key headers
        Remove_key host
        Remove_key _p
    [FILTER]
        # parse node-local-dns records
        Key_Name log
        Match *node-cache*
        Name parser
        Parser coredns
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        # parse coredns records
        Key_Name log
        Match *coredns*
        Name parser
        Parser coredns
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        # parse redash logs
        Key_Name log
        Match *redash*
        Name parser
        Parser redash
        Preserve_Key True
        Reserve_Data True
    [FILTER]
        Name record_modifier
        Match *system*
        Remove_key node
        Remove_key pod
        Remove_key resource
        Remove_key service
    [FILTER]
        Name modify
        Match Security_logs
        Remove_wildcard kubernetes
    [FILTER]
        # Rename the source field from kubernetes-event-exporter to avoid a type clash.
        Match *kube-event-exporter*
        Name modify
        Rename source kube_event_source
    [FILTER]
        # Rename the policy field from kyverno to avoid a type clash.
        Match *kyverno*
        Name modify
        Rename policy kyverno_policy
    [FILTER]
        # Give Amazon trace IDs a consistent name.
        # This is at the end after all the custom parsers have been chosen since some of them produce X_Amzn_Trace_Id
        # Unfortunately they can't produce `X-Amzn-Trace-Id` because a hyphen is an illegal character for the regex group names.
        Match *
        Name modify
        Rename request_X-Amzn-Trace-Id X-Amzn-Trace-Id
        Rename X_Amzn_Trace_Id X-Amzn-Trace-Id

  outputs: |
    [OUTPUT]
        compress gzip
        Format json
        Host logstash-cluster-shipper.elasticsearch.svc.cluster.local
        Match  *
        Name   http
        Port   443
        Retry_Limit 120
        tls On
        tls.verify Off
        json_date_format iso8601
    [OUTPUT]
        # Send to Elasticsearch security logs
        Generate_ID On
        Host elasticsearch.prod1.apps.he0.io
        Include_Tag_Key On
        Index security-logs-ilm
        Match Security_logs
        Name es
        Port 443
        Replace_Dots On
        storage.total_limit_size 10G
        Suppress_Type_Name On
        tls On
        tls.verify Off
        Trace_Error On

  customParsers: |
    [PARSER]
        Name coredns
        Format regex
        Regex ^\[(?<level>\S*)\] (?<remote>.*):(?<port>\S*) - (?<coredns_id>\S*) "(?<type>\S*) (?<coredns_class>\S*) (?<coredns_name>\S*) (?<coredns_proto>\S*) (?<size>\S*) (?<do>\S*) (?<buffer_size>\S*)" (?<rcode>\S*) (?<rflags>\S*) (?<rsize>\S*) (?<duration>\S*)
    [PARSER]
        Name heweb_apache
        Format regex
        Key_Name log
        Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
        Time_Format %d/%b/%Y:%H:%M:%S %z
        Time_Key time
    [PARSER]
        Name heweb_apache_high_precision
        Format regex
        Key_Name log
        Regex ^(?<hosts>(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+)(?:(?:,\s*(?:(?:::ffff:)?(?:\d+\.){3}\d+|[\da-f:]+))?)*|-) - (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+) (?<url_path>.*?) HTTP[^/]*/[.\d]+" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*) "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)" (?<http_request_time_us>\S*) "(?<Origin>[^\"]*)" "(?<X_Amzn_Trace_Id>[^\"]*)"
        Time_Format %d/%b/%Y:%H:%M:%S.%L %z
        Time_Key time
    [PARSER]
        Name klog
        Format regex
        Regex (?<ID>\S*)\s(?<ts>\d{2}:\d{2}:\d{2}\.\d{6})\s*(?<line>\d*)\s(?<file>\S*])\s(?<message>.*)
        # Command       |  Decoder    | Field | Optional Action   |
        # ==============|=============|=======|===================|
        Decode_Field_As    escaped       log        try_next
        Decode_Field_As    escaped_utf8  log
    [PARSER]
        Format json
        Name kube_audit
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z
        Time_Keep true
        Time_Key requestReceivedTimestamp
    [PARSER]
        Name nginx_common
        Format regex
        Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
        Time_Format %d/%b/%Y:%H:%M:%S %z
        Time_Key time
    [PARSER]
        Name nginx_high_precision
        Format regex
        Regex ^(?<client_ip>\S*) (?<server_domain>\S*) (?<user_name>\S*) \[(?<time>[^\]]*)\] "(?<http_request_method>\S+)(?: +(?<url_path>[^\"]*?)(?: +\S*)?)?" (?<http_response_status_code>\S*) (?<http_response_body_bytes>\S*)(?: "(?<http_request_referrer>[^\"]*)" "(?<user_agent_original>[^\"]*)")? (?<http_request_time>\S*) (?<http_request_bytes>\S*) (?<X_Amzn_Trace_Id>\S*) (?<X_Forwarded_For>.*)
        Time_Format %d/%b/%Y:%H:%M:%S.%L %z
        Time_Key time
    [PARSER]
        Name resque_stdout
        Format regex
        Regex ^\[(?<severity>\S*)\] .*\(Job\{(?<queue>\S*)\} \| ID\: (?<JobID>\S*) \| (?<class>\S*) \| \[(?<jsonData>.*)\]\)(?<message>.*)
    [PARSER]
        Name redash
        Format regex
        Regex ^\[(?<timestamp>.*)\]\[PID\:(?<pid>.*)\]\[(?<level>.*)\]\[(?<output>.*)\] method=(?<method>\S*) path=(?<path>\S*) endpoint=(?<endpoint>\S*) status=(?<status>\S*) content_type=(?<content_type>\S*)( charset=(?<charset>\S*))? content_length=(?<content_length>\S*) duration=(?<duration>\S*) query_count=(?<query_count>\S*) query_duration=(?<query_duration>\S*)

dnsConfig:
  options:
    - name: ndots
      value: "1"

image:
  tag: 3.0.3

luaScripts:
  functions.lua: |
    {{- (readFile "fluent-bit.lua") | nindent 4 }}

priorityClassName: daemon-sets

resources:
  limits:
    memory: 256Mi
  requests:
    cpu: 120m
    memory: 48Mi

service:
  annotations:
    service.kubernetes.io/topology-mode: auto

tolerations:
  - operator: Exists
    effect: NoExecute
  - operator: Exists
    effect: NoSchedule

updateStrategy:
  rollingUpdate:
    maxUnavailable: 20%
  type: RollingUpdate

Kubernetes version v1.28.9

AWS EC2 instances.

Kubernetes is running on Ubuntu 20.04 hosts.

Should be covered by the configuration shown above.

Additional context

If you think the crashes could be caused by functions.lua then I can also supply the content for that.

prasadkris commented 6 months ago

I am also experiencing this exact issue with Fluent Bit version 3.0.3, running on our self-hosted Kubernetes cluster (Kubernetes version: v1.28.3, Host: Ubuntu 22.04), Thanks!

cloud-thought commented 6 months ago

Facing the same issue on aarch64 (ARMv8) nodes in a Kubernetes 1.25 environment since version 3 (also tried latest 3.0.4).

UPDATE: we are currently still using 2.2.3 due to this error. In 2.2.3 this error happens too but very rarely in comparison to v3. So, maybe some related library/framework used in the background changed and this info helps to figure out which one.

clockrun commented 3 months ago

I have same issue, too. We were using 2.1.3 and there were many crash in the past. I updated it to latest 3.1.6 and the crash still happens.

From my observation, this issue happens frequently on nodes where we create pod frequently. On node where pod rarely killed or restarted have no issue. On Grafana, currently I have set memory limit to 2GB and I don't see out of memory from Grafana.

Now I'm trying 2.2.2 and 2.2.3 as suggested by above comments.

hisarbalik commented 2 months ago

We are experiencing the same issue described in the comment above, with fluent-bit crashing on the nodes where pods are frequently created. We also tested with fluent-bit version 3.1.8 and still observed the issue.

fcuello-fudo commented 1 month ago

The issue is still happening in Fluent Bit v3.1.9

[2024/10/07 18:35:23] [engine] caught signal (SIGSEGV)
#0  0x561cefcb8df9      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1  0x561cefcb8df9      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2  0x561cefcb6cfc      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x561cefce7d7f      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x561cefce7d7f      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x561cefc9a457      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1516
#6  0x561cefc9ab84      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x561cefcb9cec      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x561cefce84c4      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1256
#9  0x561cefce43c6      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x561cefc6a15a      in  flb_input_collector_fd() at src/flb_input.c:1970
#11 0x561cefc83d23      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x561cefc83d23      in  flb_engine_start() at src/flb_engine.c:941
#13 0x561cefc5f163      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7f8314fdf143      in  ???() at ???:0
#15 0x7f831505f7db      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0
[2024/10/07 13:33:06] [engine] caught signal (SIGSEGV)
#0  0x55f7afab6df9      in  flb_log_event_encoder_dynamic_field_flush_scopes() at src/flb_log_event_encoder_dynamic_field.c:210
#1  0x55f7afab6df9      in  flb_log_event_encoder_dynamic_field_reset() at src/flb_log_event_encoder_dynamic_field.c:240
#2  0x55f7afab4cfc      in  flb_log_event_encoder_reset() at src/flb_log_event_encoder.c:33
#3  0x55f7afae5d7f      in  ml_stream_buffer_flush() at plugins/in_tail/tail_file.c:418
#4  0x55f7afae5d7f      in  ml_flush_callback() at plugins/in_tail/tail_file.c:919
#5  0x55f7afa98457      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1516
#6  0x55f7afa98b84      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#7  0x55f7afab7cec      in  flb_ml_stream_id_destroy_all() at src/multiline/flb_ml_stream.c:316
#8  0x55f7afae64c4      in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1256
#9  0x55f7afae23c6      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:242
#10 0x55f7afa6815a      in  flb_input_collector_fd() at src/flb_input.c:1970
#11 0x55f7afa81d23      in  flb_engine_handle_event() at src/flb_engine.c:575
#12 0x55f7afa81d23      in  flb_engine_start() at src/flb_engine.c:941
#13 0x55f7afa5d163      in  flb_lib_worker() at src/flb_lib.c:674
#14 0x7f6a338c0143      in  ???() at ???:0
#15 0x7f6a339407db      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0
clockrun commented 1 month ago

I can confirm that version 2.2.2 and 2.2.3 are good