Handling of Extremely Large Log Messages

dpb587 commented 11 years ago

This morning I saw the broker get delayed due to what appears to be extremely large log messages. And by extremely large, I mean ~8MB... per raw log message and there were 5 similar messages nearly sequentially. After parsing, that becomes about 16 MB per message going into Elasticsearch (between @message and @fields.message). After about 12 minutes, logstash crashed with the following message before it was automatically restarted and got caught back up (losing the failing payload with 100+ log messages):

Exception in thread "LogStash::Runner" org.jruby.exceptions.RaiseException: (TimeoutError) watchdog timeout
        at RUBY.watch(file:/app/vendor/logstash.jar!/logstash/threadwatchdog.rb:31)
        at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1613)
        at RUBY.watch(file:/app/vendor/logstash.jar!/logstash/threadwatchdog.rb:20)
        at RUBY.run_with_config(file:/app/vendor/logstash.jar!/logstash/agent.rb:504)
rake aborted!

By default, the http.max_content_length for elasticsearch is set to 100 MB. With 5 of those large events (out of the 100 it sends in bulk), that's definitely over 100 MB. The elasticsearch log gives the following warning when it cuts off the logstash connection:

[2013-08-08 13:55:10,489][WARN ][http.netty               ] [Hamilton Slade] Caught exception while handling client http traffic, closing connection [id: 0xf5a21aee, /10.228.35.215:41053 => /10.33.213.15:9200]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: HTTP content length exceeded 104857600 bytes.
        at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:167)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
        at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

This is a similar error to elasticsearch/elasticsearch#2137, except that we're talking HTTP to port 9200.

For now I've archived the messages that were coming through if you want to SSH into the broker and take a closer look. See ~/largemessagesize-redis.aof for the audit, specifically the large messages on lines 263, 308, 419, 435, 449, and 5465.

I see two options here, and I think we should consider both:

Raise the http.max_content_length to something larger (e.g. 256MB).
- however... for cases where large messages do go through, the browser will hate receiving a 16MB+ JSON message in kibana whenever the event may show up in the table via a query.
Adjust the logger so that it's not creating such large log messages; it's a huge burden on all systems.

dpb587 commented 11 years ago

@sopel, thoughts on a preferred direction for this issue?

dpb587 commented 11 years ago

@mrdavidlaing suggests trimming the log message when shipping; I like that best and will pursue that instead of the other two options.

sopel commented 11 years ago

@dpb587 - thanks for the detailed analysis, and I completely agree: our use case is currently more concerned about availability of the logsearch functionality rather than never loosing any data in exceptional cases even. The latter would come as a surprise for users though (and an unnoticed one even), so it would probably make sense to log that such a truncation is happening (and count those exceptions too, even though they shouldn't occur anymore once truncation is in place) - is there an option to log a trimming event somehow?

dpb587 commented 11 years ago

@mrdavidlaing, what is the maximum length of log message you want to support before it starts trimming? 1kb, 16kb, 64kb, 128kb, 256kb, 512kb, 1mb, ...?

mrdavidlaing commented 11 years ago

Erm, 1mb?

dpb587 commented 11 years ago

I've updated the PPE broker with this code so it will no longer be an issue in the cluster.

@mrdavidlaing, you may want to update your local shipper configuration with the ruby filter to save some WAN bandwidth if more large messages come through.

cityindex-attic / logsearch

Handling of Extremely Large Log Messages #114