Closed dieend closed 7 years ago
How about --emit-error-log-interval SECONDS
option?
No, it doesn't works. It still invoke too many log.
2014-12-23 11:25:42 +0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-12-23 11:24:59 +0700 error_class="Errno::ETIMEDOUT" error="Connection timed out - connect(2) for \"SOME_IP_HERE\" port 24224" plugin_id="object:3fe1badc22c0"
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:262:in `initialize'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:262:in `new'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:262:in `connect'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:224:in `send_data'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:144:in `block in write_objects'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:138:in `times'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:138:in `write_objects'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:450:in `write'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:300:in `write_chunk'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:280:in `pop'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:311:in `try_flush'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:132:in `run'
2014-12-23 11:25:42 +0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-12-23 11:25:42 +0700 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fe1badc22c0"
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:156:in `write_objects'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:450:in `write'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:300:in `write_chunk'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:280:in `pop'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:311:in `try_flush'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:132:in `run'
2014-12-23 11:25:42 +0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-12-23 11:25:44 +0700 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fe1b5d8ed58"
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:156:in `write_objects'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:450:in `write'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:300:in `write_chunk'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:280:in `pop'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:311:in `try_flush'
2014-12-23 11:25:42 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:132:in `run'
2014-12-23 11:25:43 +0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-12-23 11:26:17 +0700 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fe1c0068df8"
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:156:in `write_objects'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:450:in `write'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:300:in `write_chunk'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:280:in `pop'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:311:in `try_flush'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:132:in `run'
2014-12-23 11:25:43 +0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-12-23 11:25:46 +0700 error_class="RuntimeError" error="no nodes are available" plugin_id="object:3fe1b99c3d28"
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/plugin/out_forward.rb:156:in `write_objects'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:450:in `write'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:300:in `write_chunk'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/buffer.rb:280:in `pop'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:311:in `try_flush'
2014-12-23 11:25:43 +0700 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.57/lib/fluent/output.rb:132:in `run'
Sorry for the delay.
error="no nodes are available" plugin_id="object:3fe1badc22c0" error="no nodes are available" plugin_id="object:3fe1b5d8ed58" error="no nodes are available" plugin_id="object:3fe1c0068df8" error="no nodes are available" plugin_id="object:3fe1b99c3d28"
Different plugin_id in each log is weird. I will check it later.
Ah, okay. num_thread prevent stacktrace suppression. I'm now thinking how to fix the problem.
In line with this issues, warning also emitted too many logs, that make hardisk full. The warning is about buffer queue size exceeds limit.
2015-01-26 09:25:14 +0700 [warn]: record_reformer: Fluent::BufferQueueLimitError queue size exceeds limit /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.58/lib/fluent/buffer.rb:182:in `block in emit'
There is 11 line emitted in 1s, you can imagine how long it needs to make 8gb hardisk full.
It seems your fluentd instance receives lots of request or sends events to stucked plugin. How about following approaches to resolve BufferQueueLimitError issue?
<secondary>
to backup events.That's a good idea, but should one plugin failure make whole server down? Currently we are not investing on adding another nodes. (At least until we get bigger traffic).
should one plugin failure make whole server down?
Should not down. In this problem, long time destination error with small buffer capacity causes lots of BufferQueueLimitError. Separating stream relaxes this problem.
How about adding max log size options? So whatever the log error, it will never pass specific size (let's say, 1GB, but it should be configurable).
So whatever the log error, it will never pass specific size
What does line mean? Should input plugin block when buffers are filled?
What does line mean? Should input plugin block when buffers are filled?
I think what he means is to set a maximum size the log can grow to. Beyond that size, older events would be removed (either deleted or sent to a backup/archive file) and new events added to keep the log size below the max size setting. Ultimately the size on disk of the log files need to be controlled so if sending to a backup/archive, need to control the size and number of those files as well.
Hmm... I see. We are now developing new plugin API including buffer. Maybe, this is a chance to add another strategy to buffer mechanizm.
Is this thread about logs of Fluentd itself? New Buffer plugin API have nothing about logging. Does @RIKIL mean about log rotation of Fluentd process itself?
Hmm... from "Beyond that size, older events would be removed (either deleted or sent to a backup/archive file) and new events added to keep the log size below the max size setting. ", I assume he mentions buffers's queue. And maybe, I think dieend's intention is also buffer. If it means a fluentd's logging, yeah, buffer plugin is not related.
We also have seen similar issues, as described in the original report - it will keep logging some message repeatedly, using a lot of CPU in the process, and eventually filling up the disk.
I think this is partially an issue with the volume of log messages fluentd generates (which can be mitigated to some extent), but my bigger concern is whether some of these issues are independent of logging, i.e. even if we disabled logging, are there situations where fluentd gets stuck in a tight loop, trying to do something that fails repeatedly due to external factors?
Looks like there are a few things that can mitigate the log impact, at least:
ah, I see 'emit_error_log_interval' is mentioned in http://docs.fluentd.org/articles/config-file but it is not explained. I tried it, and it does work, except it only suppresses specific messages.
For instance, in my case I did see 'emit transaction failed' only every 10s (as I had configured), but I still see 'syslog failed to emit' every time.
Yeah, we need to improve logging mechanizm for better control.
are there situations where fluentd gets stuck in a tight loop
It should not happen in the fluentd's main thread. In the plugin thread, it may happen by plugin bug.
log rotation should keep the disk from filling up
We are now using logrotate
for fluentd(td-agent) logs. Do you see any concern?
--suppress-repeated-stacktrace helps reduce the volume in some cases (should this be the default)
From v0.12, this is true by default.
although it appears this is also off by default and not documented outside of the code. but it is not explained.
Ah! I will update documents later.
I still see 'syslog failed to emit' every time.
Hmm... currently, fluentd uses suppression per thread. We will change this to suppression per plugin.
I ran into this in 0.12.[16|18] even using --emit-error-log-interval
. I've found it pretty easy to recreate:
journalctl -m -o json -f | ncat -w 2s -i 5s --send-only ${fluentd_ip} ${FLUENTD_PORT}
and once the buffers fill up, the logs explode with error msgs, journalD pegs a core trying to keep up, and the disk eventually fills up. And what also concerned me is the server kept accepting new messages on the tcp_input.
This all occurs even if --emit-error-log-interval
is set.
Short term fix
# consul uses nagios like checks so `exit 1` is a warning and `exit 2` is failure
( $(curl -s http://127.0.0.1:24220/api/plugins.json | jq -e '[.plugins[].buffer_queue_length] | sort | reverse | .[0] < 5') && echo 'running fine (buffers within limits)') || (echo 'getting too far behind; buffer is full.' && exit 2)
QUESTION
This surprised me as it seems like a common error condition. Am I missing anything? I can recreate it pretty easily so I suspect not but I would love to be proven wrong.
And how can I have fluentD stop accepting incoming tcp requests if the buffer fills up?
Production Outage
This issue actually took down are entire cluster of fluentD Aggregators in a few minutes:
add throttling logic to the script sending information to fluentD from journald
Fluentd's buffer is the one of throttiling mechanizm. Providing blocking option may help this problem?
And how can I have fluentD stop accepting incoming tcp requests if the buffer fills up?
It needs several modification. Michael suggests block option or input plugin handles buffer full error. For latter, back-pressure seems the best way. Input block is not always the best way but providing two behaviours is better.
It wasn't on an emergency list as it is only used for retroactive log reporting, so it was flagged as a business-hours fix
For such cases, small retry_limit and using <secondary>
for backup is an another way.
After recovered ES cluster, we can restore logs using backed up files.
Some companies use this approach.
This server was logging 2,000+ errors a second
Did you set --emit-error-log-interval
but it generates the errors?
What the error did you get?
No responses and updates. So let me close now. Please reopen if the problem still exists.
The log output this warning, with stack trace and it's repeat until hundreds of thousand later until the disk full, and the server's down.
I tried to set
log_level
toerror
, but it is still happened. I tried to add--suppress-repeated-stacktrace
options, the message is reduced, but it's still outputting many many many logs, it's only suppressing the stacktrace, not the messages. The only workaround is by setting global log level-qq
.With this I lost my ability to follow warn error. Is there anyway for me to receive as much as possible log with as little as possible log? (uh, I mean, a way to get log but not repeating infinitely).
My partial configuration: