fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.9k stars 1.34k forks source link

Log file fills up with error: "incoming chunk is broken" after about 24 hours of heavy traffic using 2.2.1 on linux #660

Open scumola opened 9 years ago

scumola commented 9 years ago

using the latest td-agent in production. After the agent gets about 24 hours old, it starts complaining and logging in /var/log/td-agent/td-agent.log "incoming chunk is broken". td-agent took up 100% CPU and fills up the disk with these log messages very quickly (about 5G/hr of these log messages with our setup). Had to take td-agent out of production due to this issue as this was unacceptable behavior and affected our production performance. This was on the app-server side and our td-agent configuration is here: http://pastebin.com/gEwKUPX1 We're basically just tailing nginx logs and allowing the app to send rails logs directly to the td-agent socket with failover servers downstream.

prasadkarkera commented 7 years ago

No, its still an issue for me.

orasik commented 7 years ago

@scumola So with large logs it will still an issue. Thing is, I can see large logs going to destination without issues so not quite sure what is the problem. In my case I have dockerised fluentd cluster on AWS that listens to other fluentd containers from different applications and forward to cloudwatch. Sending directly from fluentd containers to CloudWatch had absolutely no issues, it only started when I forwarded from fluentd to another fluentd.

orasik commented 7 years ago

Looking at the code on master, it looks like that its not an issue in terms of logging and from my understanding it does not affect the service:

# file lib/fluent/plugin/in_forward.rb
      # TODO: raise an exception if broken chunk is generated by recoverable situation
      unless msg.is_a?(Array)
        log.warn "incoming chunk is broken:", host: conn.remote_host, msg: msg
        return
      end
sebastianmacarescu commented 7 years ago

I think in my case the issue was related to heartbeat. Since the fluentd aggregator node was running behind aws elb, the connection was dropped after a period of inactivity (idle timeout from elb) and the node was stuck on reconnecting. I did not find a way to set the heartbeat to tcp using fluentd docker logging driver. Also, i'm not even sure this is the actual problem since I could'n find any documentation/info related to this.

To overcome the problem, I decided to use td-agent (configured with tcp heartbeat) just to forward logs from containers (fluent logging driver pointing to localhost) to an aggregator node on a separate ecs cluster. The solution seems to work ok, even for large and frequent data.

dpgaspar commented 5 years ago

We are getting something similar but with in_tail that forward to fluentd aggregators:

December 26th 2018, 13:10:37.903 | incoming chunk is broken: host="10.XX.YYY.ZZZ" msg=79
December 26th 2018, 13:10:37.903 | incoming chunk is broken: host="10.XX.YYY.ZZZ" msg=79
December 26th 2018, 13:10:37.903 | incoming chunk is broken: host="10.XX.YYY.ZZZ" msg=79
December 26th 2018, 13:10:37.903 | dump an error event: error_class=NoMethodError 
error="undefined method `merge!' for nil:NilClass" location="/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.6.1/lib/fluent/plugin/filter_record_transformer.rb:135:in `reform'" tag="raw.mysql.audit.log" time=79 record=nil

Notice also the undefined method `merge!'.

We are shipping 20K/s logs. This happens for 20/30s, and looks associated with a network failure between VPC's since, we notice other errors reporting related with other services at the same time during the same period.

Fluentd aggregators are also behind AWS ELB's

By the way, is there a way to suppress log repetion messages on errors? this generated td-agent.log of about 7GB on seconds

eeshaisha commented 5 years ago

Any solution/workaround for this issue??

jstaffans commented 5 years ago

I experienced the same issue with Fluentd deployed behind an ELB. Switching to a Network load balancer solved it for me.

kdevendran-branch commented 5 years ago

Hi,

I am facing the similar issue..any fix for this?

garild commented 5 years ago

i got same issue ;)

Anyone know how to fix it ?

kiddingl commented 4 years ago

I got the same error with ingress k8s.

eredi93 commented 4 years ago

Also having this issue in production. gonna try to replicate it and share my findings

eredi93 commented 4 years ago

I was able to replicate the issue by having a bunch of for loops to put some load on the application (i basically had 4 endless loops sending GET requests) and then sending big payloads to Fluentd. i found that sending payload in the order of millions of characters were processed just fine when no other events were sent to Fluentd, but these big events are sent while Fluentd is also processing a high number of events then the issue start happening. i found that Fluentd would fix itself after a couple of hours. alternatively a restart also brings it back to a healthy state.

ksed commented 3 years ago

Yep, I've tried many different ways to start up fluentd via the docker examples, and check the docker logs to see why my posts hang, only to find bukus of these "message":"[input1] incoming chunk is broken: host=\"172.17.0.1\" msg=125" warnings piling up (if I don't stop the post), and no log-file content. And I can't find anyone that addresses this most basic issue. I expect the default fluent.conf to provide at least basic logging. But, alas it's just not that simple.

This micro-service is certainly not fluent in my experience so far...

armaghanzhand commented 3 years ago

I have the same issue, if anyone counts! :)

ManilynRamos commented 3 years ago

i have the same issue and we're only using docker

wperezp commented 3 years ago

Same issue. We have a very basic td-agent deployment on an EC2, receiving messages from Treasure Data library on our backend servers. The messages seem to come from specific IPs but I have no other clue whatsoever.

krushnakantb commented 2 years ago

same warning message we are getting while we are performing Nessus scan for security patches on that machine...is there any solution....Normally we are not getting it but while scanning "Incoming chunk is broken" message is coming and after that fluentd stops working.. after restarting td-agent service again it works properly but we want some other automated solution. Please provide solution if any.