fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.93k stars 1.34k forks source link

Log file fills up with error: "incoming chunk is broken" after about 24 hours of heavy traffic using 2.2.1 on linux #660

Open scumola opened 9 years ago

scumola commented 9 years ago

using the latest td-agent in production. After the agent gets about 24 hours old, it starts complaining and logging in /var/log/td-agent/td-agent.log "incoming chunk is broken". td-agent took up 100% CPU and fills up the disk with these log messages very quickly (about 5G/hr of these log messages with our setup). Had to take td-agent out of production due to this issue as this was unacceptable behavior and affected our production performance. This was on the app-server side and our td-agent configuration is here: http://pastebin.com/gEwKUPX1 We're basically just tailing nginx logs and allowing the app to send rails logs directly to the td-agent socket with failover servers downstream.

scumola commented 9 years ago

Another datapoint: we're using the fluent-logger gem (v0.4.3) to send logs to td-agent if that helps.

scumola commented 9 years ago

Messages are: 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=34 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=104 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=116

repeatedly commented 9 years ago

Very weird.

https://github.com/fluent/fluentd/blob/d6b2c505f8c59661bbeef30fa590530f2dda1fe9/lib/fluent/plugin/in_forward.rb#L129

Here is the logging code. fluent-logger should send events as an array object like [tag, time, record]. But from your log, your application send integer object to fluentd. How to use fluent-logger in your code?

scumola commented 9 years ago
  def send_data(tag=nil, payload=nil)
    tag = "radish.calls" if tag.blank?
    unless payload.blank?
      begin
        Timeout::timeout(CUSTOM_LOG_TIMEOUT) do
          response = Fluent::Logger.post(tag, payload)
        end
      rescue Timeout::Error
        Rails.logger.info "FluentD Custom log Write timed out  : #{Time.now.utc}"
      rescue Exception => e
        Rails.logger.info "FluentD write exception :#{e.message}"
      end
    end
  end
scumola commented 9 years ago

Ok, our developer who wrote the code says that we're using it like this:

require 'fluent-logger'
log = Fluent::Logger::FluentLogger.new(nil, :host=>'localhost', :port=>24224)
unless log.post("myapp.access", {"agent"=>"foo"})
  p log.last_error # You can get last error object via last_error method
end

but he is saying that the error is from td-agent on our app servers not being able to talk downstream to our td-agent server that aggregates all of the logs even though this message is getting logged on the client app servers. Could this be the case? If so, is there a setting that I can use in td-agent.conf that will keep the agent quiet on the app server side if communication is lost to the downstream td-agent server? When we restart the agent on the app servers, they stay happy for an hour or two and then when they get this error, they fill up the td-agent.log file with several GB in less than an hour and this can't happen in production on our app server. Is there a setting to have it more well-behaved in this situation?

scumola commented 9 years ago

Since turning off the app->td-agent logging, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains, so we believe that this issue is something inside of td-agent that is breaking after a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server. Not td-agent->td-agent forwarding.

repeatedly commented 9 years ago

, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains,

It means you use two input plugin, tail and forward, right?

a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server.

So problem happens at only app(fluent-logger) -> td-agent(in_forward), not td-agent(out_forward) -> td-agent(in_forward)? Could you tell me how much log volume, request/sec and one log size, and what is your machine spec? I want to reproduce this problem on our environment.

scumola commented 9 years ago

Correct. We were sending 50-100 lines per hit in the app. Using ruby/passenger on a very busy site. td-agent would flip out more quickly when the traffic was higher (peak times, td-agent would freak out quicker). The shortest times were about 3-4 hours between restarting the app. Longest times would be around 12 hours. Let me know if I can help out some other way.

scumola commented 9 years ago

https://www.evernote.com/l/AClXQKXeFKNIj64x2h1Jh2cvk4a0tHo2VFY We were logging approx 10k log lines per minute (but spread across 40-ish servers) when this was enabled and working.

repeatedly commented 9 years ago

Thanks! I will try to reproduce this problem with similar log volume.

repeatedly commented 9 years ago

I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?

scumola commented 9 years ago

Multiple threads on the same machine to a single tdagent process

On Thu, Sep 3, 2015, 2:57 AM Masahiro Nakagawa notifications@github.com wrote:

I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?

— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137383057.

repeatedly commented 9 years ago

Multiple threads on the same machine

Sharing one fluent-logger between threads or each thread has each fluent-logger?

scumola commented 9 years ago

Sharing one.

On Fri, Sep 4, 2015, 3:38 AM Masahiro Nakagawa notifications@github.com wrote:

Multiple threads on the same machine

Sharing one fluent-logger between threads or each thread has each fluent-logger?

— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137690940.

repeatedly commented 9 years ago

I am testing fluent-logger using 6 threads with 300 records/sec in 2 days but above error doesn't happen. I'm still investigating this issue...

AKoetsier commented 9 years ago

Same issue here. Running an application with Phusion Passenger. Using it fluent-logger-ruby as a normal (request/rails) logger for the application. And also using it to write logs with another tag, this uses a class variable in a controller so it is shared between requests (and not threads if I'm correct). Load is about 200 records/sec and getting errors about broken chunks.

scumola commented 9 years ago

We have 16 Passenger Apps writing to the same single socket on the same machine.

[deploy@myapp01 ~]$ ps auxw | grep RubyApp
deploy    1349  0.0  1.9 680216 147240 ?       Sl   18:59   0:01 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    1659  0.1  3.9 960148 303968 ?       Sl   18:42   0:11 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    5218  0.0  1.6 677072 126328 ?       Sl   19:55   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    5456  0.0  1.6 748452 129212 ?       Sl   20:10   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    8044  0.0  1.6 677112 122672 ?       Sl   20:26   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    9929  0.0  1.7 748452 132584 ?       Sl   19:39   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   12707  0.0  1.7 682880 134304 ?       Sl   19:24   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   17563  0.0  1.7 748456 137524 ?       Sl   19:07   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   18803  0.0  2.1 752568 165676 ?       Sl   18:52   0:02 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   21917  0.3  3.2 884676 246792 ?       Sl   18:34   0:26 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   23135  0.0  1.6 682920 128408 ?       Sl   20:18   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   24635  0.0  1.6 678052 125280 ?       Sl   20:02   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   24710  0.0  1.6 678164 127584 ?       Sl   19:48   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   27315  0.0  1.7 682136 136072 ?       Sl   19:15   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   29626  0.3  1.6 681920 126520 ?       Sl   20:33   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   30958  0.0  1.7 682920 133560 ?       Sl   19:31   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   31651  0.0  0.0 103252   836 pts/0    S+   20:34   0:00 grep RubyApp
repeatedly commented 9 years ago

Hmm... I need to setup Passenger environment. Could you show me Passenger configuration and version? I'm not familiar with Passenger so I want to know actual example.

scumola commented 9 years ago

https://www.phusionpassenger.com/documentation/Users%20guide%20Nginx.html

repeatedly commented 9 years ago

Thanks! I will try it.

scumola commented 9 years ago

Any followup on this issue?

repeatedly commented 9 years ago

Sorry, I'm now busy with other tasks until this weekend. I will try this in the next week.

repeatedly commented 9 years ago

I setup nginx with passenger on my EC2 environment and use official demo app https://github.com/phusion/passenger-ruby-rails-demo . Now I am sending 250 requests/sec to this app. I hope reproduce this issue...

scumola commented 9 years ago

You need to be also sending logs to fluentd from within the app using that gem. Is your app sending logs?

repeatedly commented 9 years ago

Yes. Raills app sends request.params with additional fields to td-agent's in_forward using fluent-logger 0.5.0. The number of workers is 4.

repeatedly commented 9 years ago

Hmm... I can't reproduce this problem in two days. In passenger, how to launch multiple passenger apps? Setup multiple server in nginx.conf?

AKoetsier commented 9 years ago

We only use a single instance. Maybe the trick is that we use a class variable to cache the connection between requests. So in a model I have:

class TestClass
  def self.connection
    return @connection if defined? @connection
    fluent_config = YAML.load(ERB.new(Rails.root.join("config", "fluent-logger.yml").read).result)
    settings = {
        host: fluent_config['fluent_host'],
        port: fluent_config['fluent_port'],
        messages_type: 'string'
    }
    @connection = Fluent::Logger::FluentLogger.new(tag_prefix, settings)
  end

  def save
    self.class.connection.post(self.characteristic, self.serializable_hash)
  end

  ...
end
breath-co2 commented 9 years ago

We have the same problem. I send 3340095 at 10 minutes(~6000/s). after a few minutes I found very much logs:

2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=48
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=123
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=99
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=115
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=51
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=57
repeatedly commented 9 years ago

@breath-co2 Do you use passenger or plain Ruby app?

breath-co2 commented 9 years ago

@repeatedly I use php td-agent client. td-agent version is 0.12.15. td-agent config is like:

<source>
  type forward
  port 12400
</source>

<match **>
  type copy
  <store>
    type file
    ......
  </store>
  <store>
    type aws_s3
    ......
  </store>
  <store>
      type webhdfs
      ......
  </store>
</match>
repeatedly commented 9 years ago

@breath-co2 mod_php? Could you paste the sample app here or use sample code with 6000req/sec causes the problem? > https://github.com/fluent/fluent-logger-php/tree/61ad26d9b492ceed197c3b9307e996b096fba123#usage

breath-co2 commented 9 years ago

@repeatedly yesterday, I test it long time.I found the cycle test data is not wrong. but my online environment data to be wrong.I debug found it use on_read_msgpack method when warn log is

incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 55869" msg=117
incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 55869" msg=110

bug my app is push json data. The td-agent have very much buffer, I guess after write buffer abnormal, the td-agent server can not get the right data.

breath-co2 commented 9 years ago

@repeatedly I finally found a way to test this bug. td-agent config is simple:

<source>
  type forward
  port 1200
</source>
<match **>
    type stdout
</match>

run

telnet 127.0.0.1 1200

if you write first line is json data,like that ["test.a",1234567890,{"a":1}], then if you wirte a string like abcdef, then Connection closed by foreign host.

It's correct.

but, if you telnet 127.0.0.1 1200 then write abcdef at first line, then you can found warn logs:

2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=97 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=98 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=99 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=100 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=101 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=102 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=13 type="msgpack"
2015-10-14 11:35:18 +0800 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 57310" msg=10 type="msgpack"

I know it is right warn message, but next I write a json data like ["test.a",1234567890,{"a":1}] then td-agent output "incoming chunk is broke" warn log, and the connection is not close, app client can write data so it can't found it is error.

repeatedly commented 9 years ago

@breath-co2 Thanks for the investigation. I will check it with your steps.

repeatedly commented 9 years ago

For @breath-co2 case, it could happen because in_forward assumes same format in one connection and in_forward treats received data as a msgpack when receives non {/[ started content. On the other hand, fluent-logger-ruby always send msgpack foromat to in_forward so it should not happen. I need more time...

jarmaker commented 8 years ago

any resolution on this thread ?

repeatedly commented 8 years ago

Currently, output to file and use in_tail is an alternative approach to avoid this problem.

We are now investigating this issue to find the cause of problem, network, plugin or client.

mgcrea commented 8 years ago

Encountering this as well using fluent-logger-node -> local td-agent 0.12.20 -> remote fluentd.

On the local td-agent:

2016-02-15 13:09:08 +0100 [warn]: temporarily failed to flush the buffer. next_retry=2016-02-15 13:13:31 +0100 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fe4e48c8a58"
  2016-02-15 13:09:08 +0100 [warn]: suppressed same stacktrace

config:

<match player-*.**>
  type forward
  host storage.foobar.io
  port 24224
</match>

On the remote:

2016-02-15 12:05:42 +0000 fluent.warn: {"source":"host: 193.248.147.205, addr: 193.248.147.205, port: 25633","msg":10,"message":"incoming chunk is broken: source=\"host: 193.248.147.205, addr: 193.248.147.205, port: 25633\" msg=10"}
repeatedly commented 8 years ago

@mgcrea How much your traffic? High or low? And how many local td-agents which sends to remote td-agent? I assume N : 1 topology.

mgcrea commented 8 years ago

@repeatedly Fixed my issue, was not related to fluentd but docker + UFW. It looks like the new post 1.10 docker networking stack gets constrained by UFW.

repeatedly commented 8 years ago

@mgcrea I see. Thanks for the info!

kumavis commented 8 years ago

also getting a log full of incomming chunk is broken from docker (1.11)'s fluentd log-driver running on ubuntu @mgcrea how did you resolve this?

kumavis commented 8 years ago

this resolved itself somehow - I restarted either the fluentd container or the log emitting container but I'm not sure which -- sorry I couldn't be more help

Bhaskars commented 8 years ago

It is still reproducible for us.

scumola commented 8 years ago

UPDATE: I built a cluster of fluentd servers using the latest code base (not the rpm, but the gems from the site) built on top of the latest Amazon AMI - so the kernel is the latest with all of the updates and fluentd is the latest with all of the updates (as of Jul 13, 2016). I've also added some sysctl.conf changes to increase the tcp windows size on linux.

What we're seeing is that the json payload from the fluentd client gem is getting corrupted almost immediately when we unleash production on our fluentd cluster. I even put 10 fluentd servers behind an ELB in case it was an issue of just too much traffic for linux or fluentd to deal with, but the issue has to be something else. The machines are continuing to give the "incoming chunk is broken" errors almost immediately once production traffic starts flowing. Note, that the payloads are larger than then typical one-line web server logs. Once the "incoming chunk is broken" errors start to show up, the stability of the fluentd server is impacted pretty seriously - it's hard to say with the latest version of the code how badly the server's state is, but it impacts performance on the client-side once this "incoming chunk is broken" warning starts to show up.

A typical payload is something like this: https://gist.github.com/scumola/bfd9775c246f71ca93deb363292aa9bb

prasadkarkera commented 7 years ago

@mgcrea How did you fix this? I am using fluentd docker containers behind AWS elb and seeing this same error message. Not sure whats going on.

scumola commented 7 years ago

Personally, I don't think that it's fixed. We solved the issue by not sending large multi-line logs. Tailing a log w/ Fluentd seems to work much better than the client-side push of logs using their API especially if it's a multi-line log. Just my $0.02.

prasadkarkera commented 7 years ago

@scumola Thanks. But, I am seeing this in our dev, where the fluentd is sitting ideal most of the time and we don't send multiline logs. Fyi, I am using the fluentd docker logging driver to forward the logs from other containers to fluentd.

sebastianmacarescu commented 7 years ago

@prasadkarkera Hi. Did you manage to solve the issue? I have a similar problem: I run a fluentd container behind aws elb and use the docker logging driver. After a while the driver just seems to stop sending logs and memory usage starts to build up. If I restart the source container, the logs fill up with 'incomming chunk is broken'.

scumola commented 7 years ago

Ok, so years after this has been a problem, we avoided this by reducing the logging that we're doing. When we were sending multi-line (really large) log lines through fluentd, we'd have multiple issues - the 'incoming chunk is broken' would show up when the log is not well-formed or sometimes if the field names conflict with other field names with the same name. Reducing the size of our log lines seemed to make this issue go away for us. Logging large payloads may be done easier by a logging system that doesn't parse the actual logs until they're in the storage engine.