Closed gregsymons closed 6 years ago
I'm having a similar issue, but it's slightly different in that the connection is never restored --
2018-04-09 09:01:00 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-09 09:01:01 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-09 09:05:46 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-09 09:05:47 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-09 09:06:08 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-09 09:06:09 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-09 21:01:59 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2018-04-09 21:02:00 +0000 chunk="56970b3f264de1eda42c9cf0ba1ff1a8" error_class=Errno::ECONNRESET error="Connection reset by peer"
2018-04-09 21:01:59 +0000 [warn]: #0 /usr/lib/ruby/2.3.0/openssl/buffering.rb:322:in `syswrite'
2018-04-09 21:01:59 +0000 [warn]: #0 /usr/lib/ruby/2.3.0/openssl/buffering.rb:322:in `do_write'
2018-04-09 21:01:59 +0000 [warn]: #0 /usr/lib/ruby/2.3.0/openssl/buffering.rb:340:in `write'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:140:in `block (2 levels) in send_to_datadog'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:138:in `each'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:138:in `block in send_to_datadog'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:129:in `synchronize'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:129:in `send_to_datadog'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:124:in `write'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/compat/output.rb:131:in `write'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin/output.rb:1096:in `try_flush'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin/output.rb:1329:in `flush_thread_run'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin/output.rb:439:in `block (2 levels) in start'
2018-04-09 21:01:59 +0000 [warn]: #0 /var/lib/gems/2.3.0/gems/fluentd-1.1.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-04-09 21:02:00 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2018-04-09 21:02:00 +0000 chunk="56970b3f264de1eda42c9cf0ba1ff1a8" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-09 21:02:00 +0000 [warn]: #0 suppressed same stacktrace
The last message simply repeats, and the exponential back-off ends up delaying attempts more and more until there are no more logs on the host. It seems like it's just attempting to write the same SSL block over and over, having not noticed that the connection was reset.
I'm experiencing similar issue to @rootd00d . I have a device that has spotty Internet connection it seems, which is a problem by itself. The logs eventually just stop shipping.
Example log:
2018-04-14 01:14:38 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-14 01:14:39 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 02:59:39 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-14 02:59:40 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 05:16:26 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-14 05:16:27 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 05:16:27 +0000 [warn]: #0 no patterns matched tag="fluent.info"
2018-04-14 06:17:12 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-14 06:17:13 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 06:59:44 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-14 06:59:45 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 07:59:45 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-14 07:59:46 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 10:16:18 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-14 10:16:19 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 11:14:05 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-14 11:14:06 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 16:01:03 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-14 16:01:04 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 17:59:37 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-14 17:59:38 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 20:15:26 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-14 20:15:27 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-14 20:15:27 +0000 [warn]: #0 no patterns matched tag="fluent.info"
2018-04-14 20:22:04 +0000 [warn]: #0 Could not push logs to Datadog, attempt=4 max_attempts=-1 wait=9s error=Connection timed out - connect(2) for "intake.logs.datadoghq.com" port 10516
2018-04-14 20:22:04 +0000 [info]: #0 New attempt to Datadog attempt=4
2018-04-14 20:28:43 +0000 [warn]: #0 Could not push logs to Datadog, attempt=6 max_attempts=-1 wait=25s error=Connection timed out - connect(2) for "intake.logs.datadoghq.com" port 10516
2018-04-14 20:28:43 +0000 [info]: #0 New attempt to Datadog attempt=6
2018-04-14 20:35:23 +0000 [warn]: #0 Could not push logs to Datadog, attempt=8 max_attempts=-1 wait=30s error=Connection timed out - connect(2) for "intake.logs.datadoghq.com" port 10516
2018-04-14 20:35:23 +0000 [info]: #0 New attempt to Datadog attempt=8
2018-04-14 20:42:02 +0000 [warn]: #0 Could not push logs to Datadog, attempt=10 max_attempts=-1 wait=30s error=Connection timed out - connect(2) for "intake.logs.datadoghq.com" port 10516
2018-04-14 20:42:02 +0000 [info]: #0 New attempt to Datadog attempt=10
2018-04-14 20:48:41 +0000 [warn]: #0 Could not push logs to Datadog, attempt=12 max_attempts=-1 wait=30s error=Connection timed out - connect(2) for "intake.logs.datadoghq.com" port 10516
2018-04-14 20:48:41 +0000 [info]: #0 New attempt to Datadog attempt=12
2018-04-14 20:55:21 +0000 [warn]: #0 Could not push logs to Datadog, attempt=14 max_attempts=-1 wait=30s error=Connection timed out - connect(2) for "intake.logs.datadoghq.com" port 10516
2018-04-14 20:55:21 +0000 [info]: #0 New attempt to Datadog attempt=14
2018-04-14 20:59:47 +0000 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=2661.4450535959913 slow_flush_log_threshold=20.0 plugin_id="object:14ac114"
2018-04-15 00:01:03 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-15 00:01:04 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-15 00:59:49 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-15 00:59:50 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-15 03:24:21 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-15 03:24:22 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-15 04:59:37 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Broken pipe
2018-04-15 04:59:38 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-15 06:15:24 +0000 [warn]: #0 Could not push logs to Datadog, attempt=2 max_attempts=-1 wait=1s error=Connection timed out
2018-04-15 06:15:25 +0000 [info]: #0 New attempt to Datadog attempt=2
2018-04-15 07:17:22 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2018-04-15 07:17:23 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:22 +0000 [warn]: #0 /usr/lib/ruby/2.1.0/openssl/buffering.rb:326:in `syswrite'
2018-04-15 07:17:22 +0000 [warn]: #0 /usr/lib/ruby/2.1.0/openssl/buffering.rb:326:in `do_write'
2018-04-15 07:17:22 +0000 [warn]: #0 /usr/lib/ruby/2.1.0/openssl/buffering.rb:344:in `write'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:140:in `block (2 levels) in send_to_datadog'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:138:in `each'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:138:in `block in send_to_datadog'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:129:in `synchronize'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:129:in `send_to_datadog'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluent-plugin-datadog-0.9.6/lib/fluent/plugin/out_datadog.rb:124:in `write'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluentd-1.1.2/lib/fluent/compat/output.rb:131:in `write'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluentd-1.1.2/lib/fluent/plugin/output.rb:1094:in `try_flush'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluentd-1.1.2/lib/fluent/plugin/output.rb:1327:in `flush_thread_run'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluentd-1.1.2/lib/fluent/plugin/output.rb:439:in `block (2 levels) in start'
2018-04-15 07:17:22 +0000 [warn]: #0 /var/lib/gems/2.1.0/gems/fluentd-1.1.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-04-15 07:17:23 +0000 [warn]: #0 failed to flush the buffer. retry_time=0 next_retry_seconds=2018-04-15 07:17:23 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:23 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:17:24 +0000 [warn]: #0 failed to flush the buffer. retry_time=1 next_retry_seconds=2018-04-15 07:17:24 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:24 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:17:26 +0000 [warn]: #0 failed to flush the buffer. retry_time=2 next_retry_seconds=2018-04-15 07:17:26 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:26 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:17:31 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2018-04-15 07:17:31 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:31 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:17:39 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2018-04-15 07:17:39 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:39 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:17:56 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2018-04-15 07:17:56 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:17:56 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:18:27 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2018-04-15 07:18:27 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:18:27 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:19:30 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2018-04-15 07:19:30 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:19:30 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:21:23 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2018-04-15 07:21:23 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:21:23 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:25:59 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2018-04-15 07:25:59 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:25:59 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:33:44 +0000 [warn]: #0 failed to flush the buffer. retry_time=10 next_retry_seconds=2018-04-15 07:33:44 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:33:44 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 07:49:56 +0000 [warn]: #0 failed to flush the buffer. retry_time=11 next_retry_seconds=2018-04-15 07:49:56 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 07:49:56 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 08:24:30 +0000 [warn]: #0 failed to flush the buffer. retry_time=12 next_retry_seconds=2018-04-15 08:24:30 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 08:24:30 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 09:38:34 +0000 [warn]: #0 failed to flush the buffer. retry_time=13 next_retry_seconds=2018-04-15 09:38:34 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 09:38:34 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 11:42:10 +0000 [warn]: #0 failed to flush the buffer. retry_time=14 next_retry_seconds=2018-04-15 11:42:10 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 11:42:10 +0000 [warn]: #0 suppressed same stacktrace
2018-04-15 15:43:29 +0000 [warn]: #0 failed to flush the buffer. retry_time=15 next_retry_seconds=2018-04-15 15:43:29 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-15 15:43:29 +0000 [warn]: #0 suppressed same stacktrace
2018-04-16 01:17:02 +0000 [warn]: #0 failed to flush the buffer. retry_time=16 next_retry_seconds=2018-04-16 01:17:02 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-16 01:17:02 +0000 [warn]: #0 suppressed same stacktrace
2018-04-16 18:24:03 +0000 [warn]: #0 failed to flush the buffer. retry_time=17 next_retry_seconds=2018-04-16 18:24:03 +0000 chunk="569dde1cb646f5e640e7f340ca080c57" error_class=OpenSSL::SSL::SSLError error="SSL_write: bad write retry"
2018-04-16 18:24:03 +0000 [warn]: #0 suppressed same stacktrace
Hello @gregsymons , @rootd00d , @tmcgann
Sorry for the delay here and thanks a lot for reporting. We will look into this and check why the retry logic does not work as expected.
Thanks, Nils
I feel like the issue described here sums it up best: the payload sent to OpenSSL has to be identical in the 'retry' context. I wonder if we can force the SSL layers to prevent any kind of session caching or something?
https://postgrespro.com/list/thread-id/1783799
On Thu, May 24, 2018 at 1:36 AM, Nils notifications@github.com wrote:
Hello @gregsymons https://github.com/gregsymons , @rootd00d https://github.com/rootd00d , @tmcgann https://github.com/tmcgann
Sorry for the delay here and thanks a lot for reporting. We will look into this and check why the retry logic does not work as expected.
Thanks, Nils
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataDog/fluent-plugin-datadog/issues/5#issuecomment-391635067, or mute the thread https://github.com/notifications/unsubscribe-auth/AHATWciHcj2ve4wsNIJkISKidGuR9Oxoks5t1nEEgaJpZM4S2FMc .
So I think we have two issues here. In one issue, connections that are idle too long due to low log activity get cut off unexpectedly. The other issue is to do with the ssl retry failures. The first issue is the one I've seen. I've not seen the second, yet.
If I had to guess, “it’s under the covers.”
On May 25, 2018, at 12:27 PM, Greg Symons notifications@github.com wrote:
So I think we have two issues here. In one issue, connections that are idle too long due to low log activity get cut off unexpectedly. The other issue is to do with the ssl retry failures. The first issue is the one I've seen. I've not seen the second, yet.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
So, looking at the retry more closely, it looks like @rootd00d's problem may be caused by the fact that the rescue block doesn't catch ECONNRESET, so the client socket doesn't get reset and gets reused on the next attempt, despite being no longer valid. I'm not sure exactly what the unhandled exception for @tmcgann is, since we do catch ETIMEDOUT (and successfully recover at several points in the log), but I think it's the same problem.
Maybe we shouldn't be so specific about the exceptions we catch?
Thank you for opening this issue, for all your feedback and thorough analysis. I have just deployed a new version of the plugin, for more details see https://github.com/DataDog/fluent-plugin-datadog/pull/12. Cheers
Awesome! I'll throw this in the works tonight and let you know how we fare, but from the looks of it this seems like an adequate fix. Thanks so much @gregsymons & @ajacquemot!
We have a number of hosts with very low log volume (mostly because we haven't fully rolled out log collection, so they have fluentd configured and running, but it isn't reading any log files). We've noticed that most of the logs shipped by these hosts are actually fluent warnings like this one:
The attempt count is wrong (see my PR #4), and in every case we've seen, the next attempt seems to succeed. While I was in the code, I noticed there's some dead code to turn on TCP keepalive, which I would expect to solve the problem. Is there a reason not to actually do so?