Closed jschwartzy closed 2 years ago
Probably need to include the details of that lua script. I know it may not be relevant but it is part of the config.
Probably need to include the details of that lua script. I know it may not be relevant but it is part of the config.
Added. I can upload sample FLB files if it helps.
Hi @jschwartzy, have you observed this behavior using the Splunk plugin instead of the HTTP? In the splunk plugin we have http_debug_bad_request which may help to troubleshoot this issue.
Hi @jschwartzy, have you observed this behavior using the Splunk plugin instead of the HTTP? In the splunk plugin we have http_debug_bad_request which may help to troubleshoot this issue.
Hi @lecaros appreciate the response. Switched to the Splunk plugin for the debugging feature you mentioned. Looks like some of the payloads sent to Splunk are getting cut-off.
For example, here is a snippet of one of the rejected payloads. I'm including the beginning of the JSON data and the end to show that it's not closed properly, so Splunk is returning the invalid data
response:
{"time":1652795490.211643,"event":{"event":"2022-05-17T13:51:30,211 severity=ERROR
...
"fields":{"stream":"stdout","kube_pod_id":"3e04270c-2f27-424a-9473-e0bc9a80293b","kube_container_name":"service","kube_pod_name":"xxx",
"kube_container_hash":"xxx.dkr.ecr.us-west-2.amazonaws.com/xxx@sha256:c8e5a8e13106bd81d57975b135e9f2b45b4619463d802fd8c488011192e2d436",
"kube_host":"xx.us-west-2.compute.internal",
"kube_docker_id":"51540dab43e2e6be2fe958f45308b08e5ecab5a617df034745adbd5e61b9eb81",
"kube_container_image":"123456[2022/05/17 13:53:56]
For comparison's sake, the record should look something like:
{"time":1652795490.211437,"event":{"event":"2022-05-17T13:51:30,211 severity=INFO
...
"fields":{"stream":"stdout","kube_pod_id":"3e04270c-2f27-424a-9473-e0bc9a80293b","kube_container_name":"service","kube_pod_name":"xxx",
"kube_container_hash":"xxx.dkr.ecr.us-west-2.amazonaws.com/xxx@sha256:c8e5a8e13106bd81d57975b135e9f2b45b4619463d802fd8c488011192e2d436",
"kube_host":"xxx.us-west-2.compute.internal",
"kube_docker_id":"51540dab43e2e6be2fe958f45308b08e5ecab5a617df034745adbd5e61b9eb81",
"kube_container_image":"123456.dkr.ecr.us-west-2.amazonaws.com/xx/xyz:e53f550d3",
"kube_namespace_name":"namespace",
...
"kube_annotations_vault.hashicorp.com/agent-init-first":"true",
"kube_annotations_vault.hashicorp.com/agent-requests-mem":"128Mi",
"kube_cluster_name":"xyz","aws_region":"us-west-2","environment":"stg"}}}
Hi @jschwartzy can you share sample flb files with us to try to reproduce locally? Thanks!
Sent sample files via email. Thanks.
Hi @jschwartzy, We can tell from the provided files that they are not corrupted, but it seems the content is messed up somewhere in the pipeline. For you, is this issue reproducible using the config and Lua filter you've shared, or is it happening with a particular app log only? Having a repro is always better :)
Appreciate the reply. The issue is reproducible when deployed to one of our EKS clusters that has multiple workloads and a fair amount of traffic. It is not tied to a specific application as far as I can tell.
Any suggestions for trying to pin down where the in pipeline things could be going wrong?
Will close this issue since does not seem to be related to FLB corruption. Thanks for looking into it.
Bug Report
Describe the bug We have recently implemented the
multiline.parser
feature to recombine logs split by the Docker 16Kb limit. Our fluent-bit pods send data to local Splunk HEC pods that are then forwarded to Splunk indexers. After a few hours of operation, the forwarding to the local Splunk HEC starts to fail and we see increased back-pressure on the fluent-bit pods. They eventually get OOMKilled if limits are set or continue to chew up memory. Themultiline.parser
is the only major configuration change that was made. This issue did not occur previously. We start to see the symptom in the fluent-bit logs:Splunk is rejecting the payloads due to corrupt or invalid characters (presumably). The chunks don't get flushed and eventually start to soak up all the memory available (limit or otherwise). It's important to note that does not happen for every pod instance. It is proving to be difficult to determine exactly what is triggering the corruption.
To Reproduce
multiline.parser
Expected behavior Fluent-bit should forward the logs to Splunk without data corruption.
Screenshots DataDog graph of fluent-pods over 24h:
Your Environment
Configuration:
script_custom.lua
- parses pod annotations to configure Splunk records for the appropriate index, source, source_type.Additional context We operate several multi and single-tenant EKS clusters. Logging is one of the services we provide and teams rely on logs for troubleshooting, analytics, monitoring and alerting. If logs stop flowing, teams are hamstrung.