Open scumola opened 9 years ago
No, its still an issue for me.
@scumola So with large logs it will still an issue. Thing is, I can see large logs going to destination without issues so not quite sure what is the problem. In my case I have dockerised fluentd cluster on AWS that listens to other fluentd containers from different applications and forward to cloudwatch. Sending directly from fluentd containers to CloudWatch had absolutely no issues, it only started when I forwarded from fluentd to another fluentd.
Looking at the code on master
, it looks like that its not an issue in terms of logging and from my understanding it does not affect the service:
# file lib/fluent/plugin/in_forward.rb
# TODO: raise an exception if broken chunk is generated by recoverable situation
unless msg.is_a?(Array)
log.warn "incoming chunk is broken:", host: conn.remote_host, msg: msg
return
end
I think in my case the issue was related to heartbeat. Since the fluentd aggregator node was running behind aws elb, the connection was dropped after a period of inactivity (idle timeout from elb) and the node was stuck on reconnecting. I did not find a way to set the heartbeat to tcp using fluentd docker logging driver. Also, i'm not even sure this is the actual problem since I could'n find any documentation/info related to this.
To overcome the problem, I decided to use td-agent (configured with tcp heartbeat) just to forward logs from containers (fluent logging driver pointing to localhost) to an aggregator node on a separate ecs cluster. The solution seems to work ok, even for large and frequent data.
We are getting something similar but with in_tail that forward to fluentd aggregators:
December 26th 2018, 13:10:37.903 | incoming chunk is broken: host="10.XX.YYY.ZZZ" msg=79
December 26th 2018, 13:10:37.903 | incoming chunk is broken: host="10.XX.YYY.ZZZ" msg=79
December 26th 2018, 13:10:37.903 | incoming chunk is broken: host="10.XX.YYY.ZZZ" msg=79
December 26th 2018, 13:10:37.903 | dump an error event: error_class=NoMethodError
error="undefined method `merge!' for nil:NilClass" location="/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.6.1/lib/fluent/plugin/filter_record_transformer.rb:135:in `reform'" tag="raw.mysql.audit.log" time=79 record=nil
Notice also the undefined method `merge!'.
We are shipping 20K/s logs. This happens for 20/30s, and looks associated with a network failure between VPC's since, we notice other errors reporting related with other services at the same time during the same period.
Fluentd aggregators are also behind AWS ELB's
By the way, is there a way to suppress log repetion messages on errors? this generated td-agent.log of about 7GB on seconds
Any solution/workaround for this issue??
I experienced the same issue with Fluentd deployed behind an ELB. Switching to a Network load balancer solved it for me.
Hi,
I am facing the similar issue..any fix for this?
i got same issue ;)
Anyone know how to fix it ?
I got the same error with ingress k8s.
Also having this issue in production. gonna try to replicate it and share my findings
I was able to replicate the issue by having a bunch of for loops to put some load on the application (i basically had 4 endless loops sending GET requests) and then sending big payloads to Fluentd. i found that sending payload in the order of millions of characters were processed just fine when no other events were sent to Fluentd, but these big events are sent while Fluentd is also processing a high number of events then the issue start happening. i found that Fluentd would fix itself after a couple of hours. alternatively a restart also brings it back to a healthy state.
Yep, I've tried many different ways to start up fluentd
via the docker examples, and check the docker logs to see why my posts hang, only to find bukus of these "message":"[input1] incoming chunk is broken: host=\"172.17.0.1\" msg=125"
warnings piling up (if I don't stop the post), and no log-file content. And I can't find anyone that addresses this most basic issue. I expect the default fluent.conf
to provide at least basic logging. But, alas it's just not that simple.
This micro-service is certainly not fluent in my experience so far...
I have the same issue, if anyone counts! :)
i have the same issue and we're only using docker
Same issue. We have a very basic td-agent deployment on an EC2, receiving messages from Treasure Data library on our backend servers. The messages seem to come from specific IPs but I have no other clue whatsoever.
same warning message we are getting while we are performing Nessus scan for security patches on that machine...is there any solution....Normally we are not getting it but while scanning "Incoming chunk is broken" message is coming and after that fluentd stops working.. after restarting td-agent service again it works properly but we want some other automated solution. Please provide solution if any.
using the latest td-agent in production. After the agent gets about 24 hours old, it starts complaining and logging in /var/log/td-agent/td-agent.log "incoming chunk is broken". td-agent took up 100% CPU and fills up the disk with these log messages very quickly (about 5G/hr of these log messages with our setup). Had to take td-agent out of production due to this issue as this was unacceptable behavior and affected our production performance. This was on the app-server side and our td-agent configuration is here: http://pastebin.com/gEwKUPX1 We're basically just tailing nginx logs and allowing the app to send rails logs directly to the td-agent socket with failover servers downstream.