Open scumola opened 9 years ago
Another datapoint: we're using the fluent-logger gem (v0.4.3) to send logs to td-agent if that helps.
Messages are: 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=34 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=104 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=116
Very weird.
Here is the logging code. fluent-logger should send events as an array object like [tag, time, record]
.
But from your log, your application send integer object to fluentd.
How to use fluent-logger in your code?
def send_data(tag=nil, payload=nil)
tag = "radish.calls" if tag.blank?
unless payload.blank?
begin
Timeout::timeout(CUSTOM_LOG_TIMEOUT) do
response = Fluent::Logger.post(tag, payload)
end
rescue Timeout::Error
Rails.logger.info "FluentD Custom log Write timed out : #{Time.now.utc}"
rescue Exception => e
Rails.logger.info "FluentD write exception :#{e.message}"
end
end
end
Ok, our developer who wrote the code says that we're using it like this:
require 'fluent-logger'
log = Fluent::Logger::FluentLogger.new(nil, :host=>'localhost', :port=>24224)
unless log.post("myapp.access", {"agent"=>"foo"})
p log.last_error # You can get last error object via last_error method
end
but he is saying that the error is from td-agent on our app servers not being able to talk downstream to our td-agent server that aggregates all of the logs even though this message is getting logged on the client app servers. Could this be the case? If so, is there a setting that I can use in td-agent.conf that will keep the agent quiet on the app server side if communication is lost to the downstream td-agent server? When we restart the agent on the app servers, they stay happy for an hour or two and then when they get this error, they fill up the td-agent.log file with several GB in less than an hour and this can't happen in production on our app server. Is there a setting to have it more well-behaved in this situation?
Since turning off the app->td-agent logging, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains, so we believe that this issue is something inside of td-agent that is breaking after a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server. Not td-agent->td-agent forwarding.
, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains,
It means you use two input plugin, tail and forward, right?
a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server.
So problem happens at only app(fluent-logger) -> td-agent(in_forward), not td-agent(out_forward) -> td-agent(in_forward)? Could you tell me how much log volume, request/sec and one log size, and what is your machine spec? I want to reproduce this problem on our environment.
Correct. We were sending 50-100 lines per hit in the app. Using ruby/passenger on a very busy site. td-agent would flip out more quickly when the traffic was higher (peak times, td-agent would freak out quicker). The shortest times were about 3-4 hours between restarting the app. Longest times would be around 12 hours. Let me know if I can help out some other way.
https://www.evernote.com/l/AClXQKXeFKNIj64x2h1Jh2cvk4a0tHo2VFY We were logging approx 10k log lines per minute (but spread across 40-ish servers) when this was enabled and working.
Thanks! I will try to reproduce this problem with similar log volume.
I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?
Multiple threads on the same machine to a single tdagent process
On Thu, Sep 3, 2015, 2:57 AM Masahiro Nakagawa notifications@github.com wrote:
I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?
— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137383057.
Multiple threads on the same machine
Sharing one fluent-logger between threads or each thread has each fluent-logger?
Sharing one.
On Fri, Sep 4, 2015, 3:38 AM Masahiro Nakagawa notifications@github.com wrote:
Multiple threads on the same machine
Sharing one fluent-logger between threads or each thread has each fluent-logger?
— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137690940.
I am testing fluent-logger using 6 threads with 300 records/sec in 2 days but above error doesn't happen. I'm still investigating this issue...
Same issue here. Running an application with Phusion Passenger. Using it fluent-logger-ruby as a normal (request/rails) logger for the application. And also using it to write logs with another tag, this uses a class variable in a controller so it is shared between requests (and not threads if I'm correct). Load is about 200 records/sec and getting errors about broken chunks.
We have 16 Passenger Apps writing to the same single socket on the same machine.
[deploy@myapp01 ~]$ ps auxw | grep RubyApp
deploy 1349 0.0 1.9 680216 147240 ? Sl 18:59 0:01 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 1659 0.1 3.9 960148 303968 ? Sl 18:42 0:11 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 5218 0.0 1.6 677072 126328 ? Sl 19:55 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 5456 0.0 1.6 748452 129212 ? Sl 20:10 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 8044 0.0 1.6 677112 122672 ? Sl 20:26 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 9929 0.0 1.7 748452 132584 ? Sl 19:39 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 12707 0.0 1.7 682880 134304 ? Sl 19:24 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 17563 0.0 1.7 748456 137524 ? Sl 19:07 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 18803 0.0 2.1 752568 165676 ? Sl 18:52 0:02 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 21917 0.3 3.2 884676 246792 ? Sl 18:34 0:26 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 23135 0.0 1.6 682920 128408 ? Sl 20:18 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 24635 0.0 1.6 678052 125280 ? Sl 20:02 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 24710 0.0 1.6 678164 127584 ? Sl 19:48 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 27315 0.0 1.7 682136 136072 ? Sl 19:15 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 29626 0.3 1.6 681920 126520 ? Sl 20:33 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 30958 0.0 1.7 682920 133560 ? Sl 19:31 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 31651 0.0 0.0 103252 836 pts/0 S+ 20:34 0:00 grep RubyApp
Hmm... I need to setup Passenger environment. Could you show me Passenger configuration and version? I'm not familiar with Passenger so I want to know actual example.
Thanks! I will try it.
Any followup on this issue?
Sorry, I'm now busy with other tasks until this weekend. I will try this in the next week.
I setup nginx with passenger on my EC2 environment and use official demo app https://github.com/phusion/passenger-ruby-rails-demo . Now I am sending 250 requests/sec to this app. I hope reproduce this issue...
You need to be also sending logs to fluentd from within the app using that gem. Is your app sending logs?
Yes. Raills app sends request.params with additional fields to td-agent's in_forward using fluent-logger 0.5.0. The number of workers is 4.
Hmm... I can't reproduce this problem in two days. In passenger, how to launch multiple passenger apps? Setup multiple server in nginx.conf?
We only use a single instance. Maybe the trick is that we use a class variable to cache the connection between requests. So in a model I have:
class TestClass
def self.connection
return @connection if defined? @connection
fluent_config = YAML.load(ERB.new(Rails.root.join("config", "fluent-logger.yml").read).result)
settings = {
host: fluent_config['fluent_host'],
port: fluent_config['fluent_port'],
messages_type: 'string'
}
@connection = Fluent::Logger::FluentLogger.new(tag_prefix, settings)
end
def save
self.class.connection.post(self.characteristic, self.serializable_hash)
end
...
end
We have the same problem. I send 3340095 at 10 minutes(~6000/s). after a few minutes I found very much logs:
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=48
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=123
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=99
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=115
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=51
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=57
@breath-co2 Do you use passenger or plain Ruby app?
@repeatedly I use php td-agent client. td-agent version is 0.12.15. td-agent config is like:
<source>
type forward
port 12400
</source>
<match **>
type copy
<store>
type file
......
</store>
<store>
type aws_s3
......
</store>
<store>
type webhdfs
......
</store>
</match>
@breath-co2 mod_php? Could you paste the sample app here or use sample code with 6000req/sec causes the problem? > https://github.com/fluent/fluent-logger-php/tree/61ad26d9b492ceed197c3b9307e996b096fba123#usage
@repeatedly yesterday, I test it long time.I found the cycle test data is not wrong. but my online environment data to be wrong.I debug found it use on_read_msgpack method when warn log is
incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 55869" msg=117
incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 55869" msg=110
bug my app is push json data. The td-agent have very much buffer, I guess after write buffer abnormal, the td-agent server can not get the right data.
@repeatedly I finally found a way to test this bug. td-agent config is simple:
<source>
type forward
port 1200
</source>
<match **>
type stdout
</match>
run
telnet 127.0.0.1 1200
if you write first line is json data,like that ["test.a",1234567890,{"a":1}]
, then if you wirte a string like abcdef
, then Connection closed by foreign host.
It's correct.
but, if you telnet 127.0.0.1 1200
then write abcdef
at first line, then you can found warn logs:
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=97 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=98 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=99 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=100 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=101 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=102 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=13 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=10 type="msgpack"
I know it is right warn message, but next I write a json data like ["test.a",1234567890,{"a":1}]
then td-agent output "incoming chunk is broke" warn log, and the connection is not close, app client can write data so it can't found it is error.
@breath-co2 Thanks for the investigation. I will check it with your steps.
For @breath-co2 case, it could happen because in_forward
assumes same format in one connection and in_forward
treats received data as a msgpack when receives non {
/[
started content.
On the other hand, fluent-logger-ruby always send msgpack foromat to in_forward
so it should not happen.
I need more time...
any resolution on this thread ?
Currently, output to file and use in_tail is an alternative approach to avoid this problem.
We are now investigating this issue to find the cause of problem, network, plugin or client.
Encountering this as well using fluent-logger-node -> local td-agent 0.12.20
-> remote fluentd.
On the local td-agent:
2016-02-15 13:09:08 +0100 [warn]: temporarily failed to flush the buffer. next_retry=2016-02-15 13:13:31 +0100 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fe4e48c8a58"
2016-02-15 13:09:08 +0100 [warn]: suppressed same stacktrace
config:
<match player-*.**>
type forward
host storage.foobar.io
port 24224
</match>
On the remote:
2016-02-15 12:05:42 +0000 fluent.warn: {"source":"host: 193.248.147.205, addr: 193.248.147.205, port: 25633","msg":10,"message":"incoming chunk is broken: source=\"host: 193.248.147.205, addr: 193.248.147.205, port: 25633\" msg=10"}
@mgcrea How much your traffic? High or low? And how many local td-agents which sends to remote td-agent? I assume N : 1 topology.
@repeatedly Fixed my issue, was not related to fluentd but docker + UFW. It looks like the new post 1.10 docker networking stack gets constrained by UFW.
@mgcrea I see. Thanks for the info!
also getting a log full of incomming chunk is broken
from docker (1.11)'s fluentd log-driver running on ubuntu
@mgcrea how did you resolve this?
this resolved itself somehow - I restarted either the fluentd container or the log emitting container but I'm not sure which -- sorry I couldn't be more help
It is still reproducible for us.
UPDATE: I built a cluster of fluentd servers using the latest code base (not the rpm, but the gems from the site) built on top of the latest Amazon AMI - so the kernel is the latest with all of the updates and fluentd is the latest with all of the updates (as of Jul 13, 2016). I've also added some sysctl.conf changes to increase the tcp windows size on linux.
What we're seeing is that the json payload from the fluentd client gem is getting corrupted almost immediately when we unleash production on our fluentd cluster. I even put 10 fluentd servers behind an ELB in case it was an issue of just too much traffic for linux or fluentd to deal with, but the issue has to be something else. The machines are continuing to give the "incoming chunk is broken" errors almost immediately once production traffic starts flowing. Note, that the payloads are larger than then typical one-line web server logs. Once the "incoming chunk is broken" errors start to show up, the stability of the fluentd server is impacted pretty seriously - it's hard to say with the latest version of the code how badly the server's state is, but it impacts performance on the client-side once this "incoming chunk is broken" warning starts to show up.
A typical payload is something like this: https://gist.github.com/scumola/bfd9775c246f71ca93deb363292aa9bb
@mgcrea How did you fix this? I am using fluentd docker containers behind AWS elb and seeing this same error message. Not sure whats going on.
Personally, I don't think that it's fixed. We solved the issue by not sending large multi-line logs. Tailing a log w/ Fluentd seems to work much better than the client-side push of logs using their API especially if it's a multi-line log. Just my $0.02.
@scumola Thanks. But, I am seeing this in our dev, where the fluentd is sitting ideal most of the time and we don't send multiline logs. Fyi, I am using the fluentd docker logging driver to forward the logs from other containers to fluentd.
@prasadkarkera Hi. Did you manage to solve the issue? I have a similar problem: I run a fluentd container behind aws elb and use the docker logging driver. After a while the driver just seems to stop sending logs and memory usage starts to build up. If I restart the source container, the logs fill up with 'incomming chunk is broken'.
Ok, so years after this has been a problem, we avoided this by reducing the logging that we're doing. When we were sending multi-line (really large) log lines through fluentd, we'd have multiple issues - the 'incoming chunk is broken' would show up when the log is not well-formed or sometimes if the field names conflict with other field names with the same name. Reducing the size of our log lines seemed to make this issue go away for us. Logging large payloads may be done easier by a logging system that doesn't parse the actual logs until they're in the storage engine.
using the latest td-agent in production. After the agent gets about 24 hours old, it starts complaining and logging in /var/log/td-agent/td-agent.log "incoming chunk is broken". td-agent took up 100% CPU and fills up the disk with these log messages very quickly (about 5G/hr of these log messages with our setup). Had to take td-agent out of production due to this issue as this was unacceptable behavior and affected our production performance. This was on the app-server side and our td-agent configuration is here: http://pastebin.com/gEwKUPX1 We're basically just tailing nginx logs and allowing the app to send rails logs directly to the td-agent socket with failover servers downstream.