fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.8k stars 1.58k forks source link

FLB File Corruption #5441

Closed jschwartzy closed 2 years ago

jschwartzy commented 2 years ago

Bug Report

Describe the bug We have recently implemented the multiline.parser feature to recombine logs split by the Docker 16Kb limit. Our fluent-bit pods send data to local Splunk HEC pods that are then forwarded to Splunk indexers. After a few hours of operation, the forwarding to the local Splunk HEC starts to fail and we see increased back-pressure on the fluent-bit pods. They eventually get OOMKilled if limits are set or continue to chew up memory. The multiline.parser is the only major configuration change that was made. This issue did not occur previously. We start to see the symptom in the fluent-bit logs:

[2022/05/11 20:43:40] [error] [output:http:http.0] 10.197.229.122:8088, HTTP status=400
{"text":"Invalid data format","code":6,"invalid-event-number":0}
[2022/05/11 20:43:40] [ warn] [engine] failed to flush chunk '7-1652301819.353516109.flb', retry in 10 seconds: task_id=71, input=tail.0 > output=http.0 (out_id=0)

Splunk is rejecting the payloads due to corrupt or invalid characters (presumably). The chunks don't get flushed and eventually start to soak up all the memory available (limit or otherwise). It's important to note that does not happen for every pod instance. It is proving to be difficult to determine exactly what is triggering the corruption.

To Reproduce

Expected behavior Fluent-bit should forward the logs to Splunk without data corruption.

Screenshots DataDog graph of fluent-pods over 24h: image

Your Environment

script_custom.lua - parses pod annotations to configure Splunk records for the appropriate index, source, source_type.

    function parse_splunk_info(tag, timestamp, record)
      stream = record["stream"]
      if (not stream) then
        stream = "unknown"
      end
      k8s = record["kubernetes"]
      if (k8s) then
        k8s_annotations = k8s["annotations"]
        container_name = k8s["container_name"]
        pod_name = k8s["pod_name"]
      else
        container_name = "unknown"
        pod_name = "unknown"
      end

      record["index"] = "gameservicesweb_staging"
      record["sourcetype"] = container_name
      record["source"] = container_name
      record["host"] = pod_name

      if (k8s_annotations) then
        default_index = k8s_annotations["splunk/index"]
        container_index = k8s_annotations["splunk/index_" .. container_name]
        container_stream_index = k8s_annotations["splunk/index_" .. container_name .. "_" .. stream]
        if (container_stream_index) then
          record["index"] = container_stream_index
        elseif (container_index) then
          record["index"] = container_index
        elseif (default_index) then
          record["index"] = default_index
        end

        default_sourcetype = k8s_annotations["splunk/sourcetype"]
        container_sourcetype = k8s_annotations["splunk/sourcetype_" .. container_name]
        container_stream_sourcetype = k8s_annotations["splunk/sourcetype_" .. container_name .. "_" .. stream]
        if (container_stream_sourcetype) then
          record["sourcetype"] = container_stream_sourcetype
        elseif (container_sourcetype) then
          record["sourcetype"] = container_sourcetype
        elseif (default_sourcetype) then
          record["sourcetype"] = default_sourcetype
        end

        default_source = k8s_annotations["splunk/source"]
        container_source = k8s_annotations["splunk/source_" .. container_name]
        container_stream_source = k8s_annotations["splunk/source_" .. container_name .. "_" .. stream]
        if (container_stream_source) then
          record["source"] = container_stream_source
        elseif (container_source) then
          record["source"] = container_source
        elseif (default_source) then
          record["source"] = default_source
        end
      end

      return 1, timestamp, record
    end

Additional context We operate several multi and single-tenant EKS clusters. Logging is one of the services we provide and teams rely on logs for troubleshooting, analytics, monitoring and alerting. If logs stop flowing, teams are hamstrung.

patrick-stephens commented 2 years ago

Probably need to include the details of that lua script. I know it may not be relevant but it is part of the config.

jschwartzy commented 2 years ago

Probably need to include the details of that lua script. I know it may not be relevant but it is part of the config.

Added. I can upload sample FLB files if it helps.

lecaros commented 2 years ago

Hi @jschwartzy, have you observed this behavior using the Splunk plugin instead of the HTTP? In the splunk plugin we have http_debug_bad_request which may help to troubleshoot this issue.

jschwartzy commented 2 years ago

Hi @jschwartzy, have you observed this behavior using the Splunk plugin instead of the HTTP? In the splunk plugin we have http_debug_bad_request which may help to troubleshoot this issue.

Hi @lecaros appreciate the response. Switched to the Splunk plugin for the debugging feature you mentioned. Looks like some of the payloads sent to Splunk are getting cut-off.

For example, here is a snippet of one of the rejected payloads. I'm including the beginning of the JSON data and the end to show that it's not closed properly, so Splunk is returning the invalid data response:

{"time":1652795490.211643,"event":{"event":"2022-05-17T13:51:30,211 severity=ERROR 
...
"fields":{"stream":"stdout","kube_pod_id":"3e04270c-2f27-424a-9473-e0bc9a80293b","kube_container_name":"service","kube_pod_name":"xxx",
"kube_container_hash":"xxx.dkr.ecr.us-west-2.amazonaws.com/xxx@sha256:c8e5a8e13106bd81d57975b135e9f2b45b4619463d802fd8c488011192e2d436",
"kube_host":"xx.us-west-2.compute.internal",
"kube_docker_id":"51540dab43e2e6be2fe958f45308b08e5ecab5a617df034745adbd5e61b9eb81",
"kube_container_image":"123456[2022/05/17 13:53:56]

For comparison's sake, the record should look something like:


{"time":1652795490.211437,"event":{"event":"2022-05-17T13:51:30,211 severity=INFO
...
"fields":{"stream":"stdout","kube_pod_id":"3e04270c-2f27-424a-9473-e0bc9a80293b","kube_container_name":"service","kube_pod_name":"xxx",
"kube_container_hash":"xxx.dkr.ecr.us-west-2.amazonaws.com/xxx@sha256:c8e5a8e13106bd81d57975b135e9f2b45b4619463d802fd8c488011192e2d436",
"kube_host":"xxx.us-west-2.compute.internal",
"kube_docker_id":"51540dab43e2e6be2fe958f45308b08e5ecab5a617df034745adbd5e61b9eb81",
"kube_container_image":"123456.dkr.ecr.us-west-2.amazonaws.com/xx/xyz:e53f550d3",
"kube_namespace_name":"namespace",
...
"kube_annotations_vault.hashicorp.com/agent-init-first":"true",
"kube_annotations_vault.hashicorp.com/agent-requests-mem":"128Mi",
"kube_cluster_name":"xyz","aws_region":"us-west-2","environment":"stg"}}}
lecaros commented 2 years ago

Hi @jschwartzy can you share sample flb files with us to try to reproduce locally? Thanks!

jschwartzy commented 2 years ago

Sent sample files via email. Thanks.

lecaros commented 2 years ago

Hi @jschwartzy, We can tell from the provided files that they are not corrupted, but it seems the content is messed up somewhere in the pipeline. For you, is this issue reproducible using the config and Lua filter you've shared, or is it happening with a particular app log only? Having a repro is always better :)

jschwartzy commented 2 years ago

Appreciate the reply. The issue is reproducible when deployed to one of our EKS clusters that has multiple workloads and a fair amount of traffic. It is not tied to a specific application as far as I can tell.

Any suggestions for trying to pin down where the in pipeline things could be going wrong?

jschwartzy commented 2 years ago

Will close this issue since does not seem to be related to FLB corruption. Thanks for looking into it.