FLB File Corruption - Githubissues

jschwartzy commented 2 years ago

Bug Report

Describe the bug We have recently implemented the multiline.parser feature to recombine logs split by the Docker 16Kb limit. Our fluent-bit pods send data to local Splunk HEC pods that are then forwarded to Splunk indexers. After a few hours of operation, the forwarding to the local Splunk HEC starts to fail and we see increased back-pressure on the fluent-bit pods. They eventually get OOMKilled if limits are set or continue to chew up memory. The multiline.parser is the only major configuration change that was made. This issue did not occur previously. We start to see the symptom in the fluent-bit logs:

[2022/05/11 20:43:40] [error] [output:http:http.0] 10.197.229.122:8088, HTTP status=400
{"text":"Invalid data format","code":6,"invalid-event-number":0}
[2022/05/11 20:43:40] [ warn] [engine] failed to flush chunk '7-1652301819.353516109.flb', retry in 10 seconds: task_id=71, input=tail.0 > output=http.0 (out_id=0)

Splunk is rejecting the payloads due to corrupt or invalid characters (presumably). The chunks don't get flushed and eventually start to soak up all the memory available (limit or otherwise). It's important to note that does not happen for every pod instance. It is proving to be difficult to determine exactly what is triggering the corruption.

To Reproduce

Deploy fluent-bit 1.9.3-debug
Configure fluent-bit with multiline.parser
Use HTTP Output to forward to Splunk HEC (https://github.com/fluent/fluent-bit/issues/3699#issuecomment-897294294)
Generate log load
Observe the memory consumption over 24 hour period.
Start to see the failure messages in the logs.

Expected behavior Fluent-bit should forward the logs to Splunk without data corruption.

Screenshots DataDog graph of fluent-pods over 24h:

Your Environment

Version used: 1.9.3-debug

Configuration:

[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    info
    Parsers_File parsers.conf
    Parsers_File parsers_custom.conf
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020
    storage.path              /storage
    storage.sync              normal
    storage.checksum          false
    storage.backlog.mem_limit 10MB
    storage.metrics           on

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    multiline.parser  docker, cri
    Tag               kube.*
    Refresh_Interval  5
    Buffer_Chunk_Size 32KB
    Buffer_Max_Size   5MB
    Mem_Buf_Limit     10MB
    Rotate_Wait       60
    storage.type      filesystem
    Skip_Long_Lines   On
    DB                /tail-db/tail-containers-state.db
    DB.Sync           Normal

[FILTER]
    Name    grep
    Match   kube.*
    Exclude log ^$

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           Off
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Buffer_Size         128k

[FILTER]
    Name   lua
    Match  kube.*
    Script script_custom.lua
    Call   parse_splunk_info

[FILTER]
    Name         nest
    Match        kube.*
    Operation    lift
    Nested_under kubernetes
    Add_prefix   __kube_

[FILTER]
    Name         nest
    Match        kube.*
    Operation    lift
    Nested_under __kube_labels
    Add_prefix   __kube_labels_

[FILTER]
    Name         nest
    Match        kube.*
    Operation    lift
    Nested_under __kube_annotations
    Add_prefix   __kube_annotations_

[INPUT]
    Name           systemd
    Tag            systemd.*
    Systemd_Filter _SYSTEMD_UNIT=docker.service
    Systemd_Filter _SYSTEMD_UNIT=kubelet.service
    DB             /tail-db/systemd-state.db

[FILTER]
    Name            modify
    Match           systemd.*
    Rename          _HOSTNAME host
    Rename          SYSLOG_IDENTIFIER sourcetype
    Copy            sourcetype source
    Rename          _TRANSPORT stream
    Rename          MESSAGE log
    Rename          PRIORITY _PRIORITY
    Rename          SYSLOG_FACILITY _SYSLOG_FACILITY
    Remove_wildcard _

# Common metadata
[FILTER]
    Name   modify
    Match  *
    Add    __kube_cluster_name gameservices-stg
    Add    __aws_region us-west-2
    Add    __environment stg
    Remove time
    Rename stream __stream
    Rename log event

# Move metadata to be under "fields" so that they appear as extracted fields.
# See: http://docs.splunk.com/Documentation/Splunk/7.1.2/Data/IFXandHEC
[FILTER]
    Name          nest
    Match         *
    Operation     nest
    Wildcard      __*
    Nested_under  fields
    Remove_prefix __

#Switch to use Http plugin instead of Splunk plugin:
#https://github.com/fluent/fluent-bit/issues/3699#issuecomment-897294294
[OUTPUT]
    Name            http
    Log_level       warning
    Match           *
    Tls             off
    Host            ${HOST_IP}
    Port            8088
    Uri             /services/collector/event
    Header          Authorization Splunk ${SPLUNK_HEC_TOKEN}
    Format          json_stream
    Retry_Limit     false
    Json_date_key   time

script_custom.lua - parses pod annotations to configure Splunk records for the appropriate index, source, source_type.

    function parse_splunk_info(tag, timestamp, record)
      stream = record["stream"]
      if (not stream) then
        stream = "unknown"
      end
      k8s = record["kubernetes"]
      if (k8s) then
        k8s_annotations = k8s["annotations"]
        container_name = k8s["container_name"]
        pod_name = k8s["pod_name"]
      else
        container_name = "unknown"
        pod_name = "unknown"
      end

      record["index"] = "gameservicesweb_staging"
      record["sourcetype"] = container_name
      record["source"] = container_name
      record["host"] = pod_name

      if (k8s_annotations) then
        default_index = k8s_annotations["splunk/index"]
        container_index = k8s_annotations["splunk/index_" .. container_name]
        container_stream_index = k8s_annotations["splunk/index_" .. container_name .. "_" .. stream]
        if (container_stream_index) then
          record["index"] = container_stream_index
        elseif (container_index) then
          record["index"] = container_index
        elseif (default_index) then
          record["index"] = default_index
        end

        default_sourcetype = k8s_annotations["splunk/sourcetype"]
        container_sourcetype = k8s_annotations["splunk/sourcetype_" .. container_name]
        container_stream_sourcetype = k8s_annotations["splunk/sourcetype_" .. container_name .. "_" .. stream]
        if (container_stream_sourcetype) then
          record["sourcetype"] = container_stream_sourcetype
        elseif (container_sourcetype) then
          record["sourcetype"] = container_sourcetype
        elseif (default_sourcetype) then
          record["sourcetype"] = default_sourcetype
        end

        default_source = k8s_annotations["splunk/source"]
        container_source = k8s_annotations["splunk/source_" .. container_name]
        container_stream_source = k8s_annotations["splunk/source_" .. container_name .. "_" .. stream]
        if (container_stream_source) then
          record["source"] = container_stream_source
        elseif (container_source) then
          record["source"] = container_source
        elseif (default_source) then
          record["source"] = default_source
        end
      end

      return 1, timestamp, record
    end

Environment name and version (e.g. Kubernetes? What version?): EKS v1.20.11-eks-f17b81
Server type and version: EC2 m5.2xlarge
Operating System and version: Amazon Linux 5.10.96-90.460.amzn2.x86_64
Filters and plugins: listed in configuration above.

Additional context We operate several multi and single-tenant EKS clusters. Logging is one of the services we provide and teams rely on logs for troubleshooting, analytics, monitoring and alerting. If logs stop flowing, teams are hamstrung.

patrick-stephens commented 2 years ago

Probably need to include the details of that lua script. I know it may not be relevant but it is part of the config.

jschwartzy commented 2 years ago

Probably need to include the details of that lua script. I know it may not be relevant but it is part of the config.

Added. I can upload sample FLB files if it helps.

lecaros commented 2 years ago

Hi @jschwartzy, have you observed this behavior using the Splunk plugin instead of the HTTP? In the splunk plugin we have http_debug_bad_request which may help to troubleshoot this issue.

jschwartzy commented 2 years ago

Hi @jschwartzy, have you observed this behavior using the Splunk plugin instead of the HTTP? In the splunk plugin we have http_debug_bad_request which may help to troubleshoot this issue.

Hi @lecaros appreciate the response. Switched to the Splunk plugin for the debugging feature you mentioned. Looks like some of the payloads sent to Splunk are getting cut-off.

For example, here is a snippet of one of the rejected payloads. I'm including the beginning of the JSON data and the end to show that it's not closed properly, so Splunk is returning the invalid data response:

{"time":1652795490.211643,"event":{"event":"2022-05-17T13:51:30,211 severity=ERROR 
...
"fields":{"stream":"stdout","kube_pod_id":"3e04270c-2f27-424a-9473-e0bc9a80293b","kube_container_name":"service","kube_pod_name":"xxx",
"kube_container_hash":"xxx.dkr.ecr.us-west-2.amazonaws.com/xxx@sha256:c8e5a8e13106bd81d57975b135e9f2b45b4619463d802fd8c488011192e2d436",
"kube_host":"xx.us-west-2.compute.internal",
"kube_docker_id":"51540dab43e2e6be2fe958f45308b08e5ecab5a617df034745adbd5e61b9eb81",
"kube_container_image":"123456[2022/05/17 13:53:56]

For comparison's sake, the record should look something like:


{"time":1652795490.211437,"event":{"event":"2022-05-17T13:51:30,211 severity=INFO
...
"fields":{"stream":"stdout","kube_pod_id":"3e04270c-2f27-424a-9473-e0bc9a80293b","kube_container_name":"service","kube_pod_name":"xxx",
"kube_container_hash":"xxx.dkr.ecr.us-west-2.amazonaws.com/xxx@sha256:c8e5a8e13106bd81d57975b135e9f2b45b4619463d802fd8c488011192e2d436",
"kube_host":"xxx.us-west-2.compute.internal",
"kube_docker_id":"51540dab43e2e6be2fe958f45308b08e5ecab5a617df034745adbd5e61b9eb81",
"kube_container_image":"123456.dkr.ecr.us-west-2.amazonaws.com/xx/xyz:e53f550d3",
"kube_namespace_name":"namespace",
...
"kube_annotations_vault.hashicorp.com/agent-init-first":"true",
"kube_annotations_vault.hashicorp.com/agent-requests-mem":"128Mi",
"kube_cluster_name":"xyz","aws_region":"us-west-2","environment":"stg"}}}

lecaros commented 2 years ago

Hi @jschwartzy can you share sample flb files with us to try to reproduce locally? Thanks!

jschwartzy commented 2 years ago

Sent sample files via email. Thanks.

lecaros commented 2 years ago

Hi @jschwartzy, We can tell from the provided files that they are not corrupted, but it seems the content is messed up somewhere in the pipeline. For you, is this issue reproducible using the config and Lua filter you've shared, or is it happening with a particular app log only? Having a repro is always better :)

jschwartzy commented 2 years ago

Appreciate the reply. The issue is reproducible when deployed to one of our EKS clusters that has multiple workloads and a fair amount of traffic. It is not tied to a specific application as far as I can tell.

Any suggestions for trying to pin down where the in pipeline things could be going wrong?

jschwartzy commented 2 years ago

Will close this issue since does not seem to be related to FLB corruption. Thanks for looking into it.

fluent / fluent-bit

FLB File Corruption #5441

Bug Report