Closed dpb587 closed 11 years ago
@sopel, thoughts on a preferred direction for this issue?
@mrdavidlaing suggests trimming the log message when shipping; I like that best and will pursue that instead of the other two options.
@dpb587 - thanks for the detailed analysis, and I completely agree: our use case is currently more concerned about availability of the logsearch functionality rather than never loosing any data in exceptional cases even. The latter would come as a surprise for users though (and an unnoticed one even), so it would probably make sense to log that such a truncation is happening (and count those exceptions too, even though they shouldn't occur anymore once truncation is in place) - is there an option to log a trimming event somehow?
@mrdavidlaing, what is the maximum length of log message you want to support before it starts trimming? 1kb, 16kb, 64kb, 128kb, 256kb, 512kb, 1mb, ...?
Erm, 1mb?
I've updated the PPE broker with this code so it will no longer be an issue in the cluster.
@mrdavidlaing, you may want to update your local shipper configuration with the ruby filter to save some WAN bandwidth if more large messages come through.
This morning I saw the broker get delayed due to what appears to be extremely large log messages. And by extremely large, I mean ~8MB... per raw log message and there were 5 similar messages nearly sequentially. After parsing, that becomes about 16 MB per message going into Elasticsearch (between
@message
and@fields.message
). After about 12 minutes, logstash crashed with the following message before it was automatically restarted and got caught back up (losing the failing payload with 100+ log messages):By default, the
http.max_content_length
for elasticsearch is set to 100 MB. With 5 of those large events (out of the 100 it sends in bulk), that's definitely over 100 MB. The elasticsearch log gives the following warning when it cuts off the logstash connection:This is a similar error to elasticsearch/elasticsearch#2137, except that we're talking HTTP to port 9200.
For now I've archived the messages that were coming through if you want to SSH into the broker and take a closer look. See
~/largemessagesize-redis.aof
for the audit, specifically the large messages on lines 263, 308, 419, 435, 449, and 5465.I see two options here, and I think we should consider both:
http.max_content_length
to something larger (e.g. 256MB).