TCP logging breaks rails when connection breaks

brendenyule commented 7 years ago

If we stop our ELK server while our app is running, our app will eventually stop working altogether after about 5 minutes. We are using Passenger Standalone and it doesn't even seem to notice that anything is wrong, don't see any helpful errors anywhere.

If we use UDP, an ELK outage is handled gracefully.

Here is our configuration for our QA environment:

  config.logger = LogStashLogger.new(
  type: :multi_delegator,
  outputs: [
    { type: :file, path: "log/#{Rails.env}.log" },
    { type: :tcp, host: ENV["ELK_LOG_HOST"], port: ENV["ELK_LOG_PORT"]}
  ])

  LogStashLogger.configure do |config|
    config.customize_event do |event|
      event["application_name"] = ENV["APPLICATION_TYPE"].present? ? ENV["APPLICATION_TYPE"] : "SentryLinkwebserver"
      event["environment_name"] = Rails.env
    end
  end

dwbutler commented 7 years ago

Hi,

What version of LogStashLogger are you using? More recent versions will drop log messages when the buffer is full, by default. Older versions would block once the buffer got full.

brendenyule commented 7 years ago

We are using one of the brand new versions:

 gem 'logstash-logger', '0.23.0'

We were using an older version, but upgraded when this issue came up.

dwbutler commented 7 years ago

I created a simple test rig to try to reproduce the problem, and could not. You can see my results here: https://gist.github.com/dwbutler/8f1a779c5b29055c9d1c2bb2d6d607fb

I ran the experiment for more than five minutes. I observed that the log file was still being written to, indicating that my Ruby process was not blocked.

The only unusual thing I noticed was the memory usage slowly growing over time. Since LogStashLogger will try to reconnect to the TCP server every 5 seconds, it's possible this is leading to some kind of memory growth.

It might be helpful for you to set a custom error_logger and buffer_logger. By default, errors are logged to STDERR. It would also be helpful if you could create a sample app that reproduces the issue.

dwbutler commented 7 years ago

@brendenyule I ran my test rig again but this time replaced the single-threaded loop with a more realistic simulation of a production web server:

30.times do
  Thread.new do
    loop { logger.info("#{Thread.current.object_id} - #{Time.now}"); sleep(1); }
  end
end

I ran it for quite some time, but did not observe any deadlocking behavior.

It might help to know what version of Ruby, Passenger, and Rails you are running.

dwbutler commented 7 years ago

Since you say things work fine with UDP but not TCP, the only other thing I can possibly think of is that Ruby's implementation of TCP sockets holds onto the TCP file descriptor long after the connection is closed, but not UDP sockets. You can check and see if you're leaking connections / file descriptors by checking lsof on your server. If you do find that TCP connections take a while before the OS actually gets rid of them, your only recourse might be to increase your file descriptor limit.

LogStashLogger will keep trying to flush the buffer every 5 seconds (or even more frequently if you have a high volume of log messages), and will try to open a new connection every time it retries. It can be overly aggressive in this regard, and there is no exponential backoff algorithm in place. So it wouldn't surprise me to find out that, if the OS takes (say) 20 minutes to actually get rid of a closed TCP connection, that hundreds of zombie connections pile up in a matter of minutes. This could easily bump you over the default limit of 1024 file descriptors.

So it will probably also be helpful to know what OS you're running since this behavior might be OS specific. (I tested on a Macbook Pro.)

dwbutler commented 7 years ago

According to this, even after a TCP client has closed the connection, the OS will keep the connection open for 4 minutes in TIME_WAIT status "just in case" any more packets arrive from the server so they don't disrupt the next client that wants to use the same port.

brendenyule commented 7 years ago

Sorry for the wait.

We are using: Red Hat 4.8.3-9 phusion passenger 4.0.59 rails 4 ruby 2.2.3p173

dwbutler commented 7 years ago

What mixture of processes vs threads are you using? This would tell us how potentially how many connections are being opened. And did you check the output of lsof to see if there were a large number of connections in TIME_WAIT status?

cabello commented 6 years ago

I think my app is showing some unexpected behaviour that is related to this issue.

We are seeing log writing failed. messages in our logs and they are connected to Rack Timeout entries that we are seeing on our tracing tool (Datadog).

The logs have the following:

source=rack-timeout id=828cc4688e4051bc81f242bbe0eeb251 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=b61da0d0d8fc1bd52b6fa52b1fc21510 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=35b9e0c9a21b043c561477e27a6fedc5 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=a77929294d3749bf6f0f087a26bd3bab timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=47c345b3d51fb2fd6b59ebef9fa91bbb timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=3cda3e9dc73c2719c3b1bdd67cf8df80 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms

We are using TCP logging with puma set to 2 workers and this is the message displayed when booting up the server:

Puma starting in cluster mode...
* Version 3.10.0 (ruby 2.3.5-p376), codename: Russell's Teapot
* Min threads: 5, max threads: 5
* Environment: production
* Process workers: 2
* Preloading application
* Listening on tcp://0.0.0.0:3000
! WARNING: Detected 1 Thread(s) started in app boot:
! #<Thread:0x00000000039f2998@/usr/local/bundle/gems/logstash-logger-0.25.0/lib/logstash-logger/buffer.rb:297 sleep> - /usr/local/bundle/gems/logstash-logger-0.25.0/lib/logstash-logger/buffer.rb:299:in `sleep'

That points to https://github.com/dwbutler/logstash-logger/blob/9e8369e2d6a97ac45daf42506e045095e5744ba5/lib/logstash-logger/buffer.rb#L297

Even the simple HealthcheckController is timing out and we are clueless on what's going on, I just changed it to UDP to see if that address the issue the other thing I am considering is decreasing workers to 1 so puma does not run in clustered mode.

Hopefully this data is useful when trying to find the origin of the issue. :)

dwbutler commented 6 years ago

Thanks for the detailed information. That makes me suspect that the LogStashLogger buffer is being initialized during app boot, but before a worker process is forked. This would create a background thread. Then, the app server is forking a worker process but reusing the same Thread. This would be a Bad Thing. Mixing threading and forking always results in all kinds of misbehavior.

It would probably be best to reinitialize LogStashLogger after a worker process forks. I believe this can be accomplished in the following way (assuming Rails):

Passenger:

::PhusionPassenger.on_event(:starting_worker_process) do |forked|
  Rails.logger.reconnect
end

Puma:

# In config/puma.rb
on_worker_boot do
  Rails.logger.reconnect
end

Let me know if this helps resolve the issue.

cabello commented 6 years ago

@dwbutler thanks for the quick reply will try this soon, I just got a stack trace from some other service I also maintain:

E, [2017-11-20T15:35:31.255065 #8] ERROR -- : [LogStashLogger::Device::TCP] Errno::ETIMEDOUT - Connection timed out
2017-11-20 15:35:31 +0000: Rack app error handling request { GET / }
#<Rack::Timeout::RequestTimeoutError: Request ran for longer than 15000ms>
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:32:in `initialize'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:32:in `new'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:32:in `non_ssl_connect'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:23:in `connect'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:117:in `with_connection'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:100:in `write_batch'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:67:in `flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/buffer.rb:243:in `block in buffer_flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/buffer.rb:240:in `each'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/buffer.rb:240:in `buffer_flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:64:in `flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/logger.rb:20:in `flush'
/usr/local/bundle/gems/activesupport-5.0.6/lib/active_support/log_subscriber.rb:70:in `flush_all!'
/usr/local/bundle/gems/lograge-0.4.1/lib/lograge/rails_ext/rack/logger.rb:17:in `call_app'
/usr/local/bundle/gems/railties-5.0.6/lib/rails/rack/logger.rb:24:in `block in call'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/tagged_logging.rb:5:in `block in tagged'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/tagged_logging.rb:16:in `tagged'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/tagged_logging.rb:5:in `tagged'
/usr/local/bundle/gems/railties-5.0.6/lib/rails/rack/logger.rb:24:in `call'
/usr/local/bundle/gems/request_store-1.3.2/lib/request_store/middleware.rb:9:in `call'
/usr/local/bundle/gems/actionpack-5.0.6/lib/action_dispatch/middleware/request_id.rb:24:in `call'
/usr/local/bundle/gems/rack-2.0.3/lib/rack/runtime.rb:22:in `call'
/usr/local/bundle/gems/rack-timeout-0.4.2/lib/rack/timeout/core.rb:122:in `block in call'
/usr/local/bundle/gems/rack-timeout-0.4.2/lib/rack/timeout/support/timeout.rb:19:in `timeout'
/usr/local/bundle/gems/rack-timeout-0.4.2/lib/rack/timeout/core.rb:121:in `call'
/usr/local/bundle/gems/activesupport-5.0.6/lib/active_support/cache/strategy/local_cache_middleware.rb:28:in `call'
/usr/local/bundle/gems/actionpack-5.0.6/lib/action_dispatch/middleware/executor.rb:12:in `call'
/usr/local/bundle/gems/rack-2.0.3/lib/rack/sendfile.rb:111:in `call'
/usr/local/bundle/gems/ddtrace-0.8.2/lib/ddtrace/contrib/rack/middlewares.rb:89:in `call'
/usr/local/bundle/gems/railties-5.0.6/lib/rails/engine.rb:522:in `call'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/configuration.rb:226:in `call'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/server.rb:578:in `handle_request'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/server.rb:415:in `process_client'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/server.rb:275:in `block in run'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/thread_pool.rb:120:in `block in spawn_thread'
E, [2017-11-20T15:50:41.591086 #6] ERROR -- : [LogStashLogger::Device::TCP] Errno::ETIMEDOUT - Connection timed out

In this case I am not using the latest gems, but the problem seems similar TCP connection times out and the request itself times out because the flush is synchronous(?). We are using logz.io for as the logging service I wonder if that is caused by them not being highly available or network errors.

cabello commented 6 years ago

I changed from TCP to UDP and the problem went away, then I ran another experiment I kept the TCP logging but changed the puma configuration from clustered (multiple workers), to single mode (single worker, multiple threads) and the problem also disappeared.

Because my EC2 instances are all single core I am happier with TCP plus single worker combo.

dwbutler commented 6 years ago

@cabello If you look carefully at the stacktrace above, you'll see that Lograge is involved here. The specific line of code is here: https://github.com/roidrage/lograge/blob/ced03c55419b7f131706b476b26be8c5cd90041b/lib/lograge/rails_ext/rack/logger.rb#L17

After each request, Lograge is forcing a flush. This is done in the foreground (application) thread. Normally, LogStashLogger will do regular flushes from a background thread in order to avoid interfering with the application thread and potentially blocking it.

It doesn't surprise me that UDP doesn't suffer from the same issue, because UDP is designed to not wait for a response.

dwbutler commented 6 years ago

I wasn't aware of this, but apparently Rails::Rack::Logger is actually designed to flush all logs after every request. See https://github.com/rails/rails/blob/010e246756c09f44e901f4fd8e8eab2cb3022e95/railties/lib/rails/rack/logger.rb#L43

In that case, LogstashLogger::Device::Connectable#flush ought to be designed to flush in the background by default.

dwbutler commented 6 years ago

After looking at the code some more, I realized that just calling reconnect will not actually clean up the timer thread on the buffer. I exposed a new reset method in #138 which should actually initialize a new timer thread (as well as cleaning up any other resources allocated by the logger).

As for the issue with flush, my thought is to have a configuration option which disables manual flushing for buffered devices. If this option is turned on, calling flush will have no effect. The flush will happen anyway on a timer.

esen commented 6 years ago

I'm having a problem even with UDP. Here are logs when I start a web server:

DEBUG [939165e8] [6147] Puma starting in cluster mode... [6147] * Version 3.10.0 (ruby 2.3.1-p112), codename: Russell's Teapot [6147] * Min threads: 10, max threads: 10 [6147] * Environment: staging [6147] * Process workers: 1 DEBUG [939165e8] [6147] * Listening on tcp://0.0.0.0:9292 DEBUG [939165e8] [6147] ! WARNING: Detected 3 Thread(s) started in app boot: DEBUG [939165e8] [6147] ! #<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 sleep> - /home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:301:insleep' DEBUG [939165e8] [6147] ! #<Thread:0x00000006f31910@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/server/monitor.rb:121 sleep> - /home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/server/monitor.rb:123:in sleep' DEBUG [939165e8] [6147] ! #<Thread:0x00000006f31690@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/cluster/cursor_reaper.rb:166 sleep> - /home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/cluster/cursor_reaper.rb:168:insleep' DEBUG [939165e8] [6147] * Daemonizing... `

As you can see, there are 2 workers and application is preloaded. I'm using 'lograge' gem, which uses logstash-logger to log rails API logs to logstash. I put Rails.logger.info('logger is: ' + logger.inspect.to_s) line above following line

https://github.com/roidrage/lograge/blob/master/lib/lograge/log_subscriber.rb#L16

What I see is

I, [2018-03-16T05:44:44.502472 #6444] INFO -- : logger is: #<Logger:0x00000006123760 @progname=nil, @level=0, @default_formatter=#<Logger::Formatter:0x00000006123738 @datetime_format=nil>, @formatter=#<LogStashLogger::Formatter::JsonLines:0x000000039a1a20 @customize_event=nil, @datetime_format=nil>, @logdev=#<Logger::LogDevice:0x000000061236e8 @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<LogStashLogger::Device::UDP:0x00000006123dc8 @sync=nil, @error_logger=#<Logger:0x00000006123d50 @progname=nil, @level=0, @default_formatter=#<Logger::Formatter:0x00000006123d28 @datetime_format=nil>, @formatter=nil, @logdev=#<Logger::LogDevice:0x00000006123cd8 @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<IO:/home/deploy/apps/grouvi_api/shared/log/puma_error.log>, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x00000006123cb0>>>, @buffer_group=nil, @buffer_max_items=nil, @buffer_max_interval=nil, @drop_messages_on_flush_error=false, @drop_messages_on_full_buffer=true, @buffer_flush_at_exit=true, @buffer_logger=nil, @buffer_config={:max_items=>50, :max_interval=>5, :logger=>nil, :autoflush=>nil, :has_on_flush_error=>false, :has_on_full_buffer_receive=>true, :drop_messages_on_flush_error=>false, :drop_messages_on_full_buffer=>true, :flush_at_exit=>true}, @pending_mutex=#<Thread::Mutex:0x00000006123a30>, @flush_mutex=#<Thread::Mutex:0x000000061239b8>, @flush_timer_thread=#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>, @buffer_state={:pending_items=>{}, :pending_count=>0, :pending_mutex=>#<Thread::Mutex:0x00000006123a30>, :outgoing_items=>{}, :outgoing_count=>0, :flush_mutex=>#<Thread::Mutex:0x000000061239b8>, :last_flush=>"2018-03-16T05:44:35.045+00:00", :timer=>#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>}, @port="5010", @host="hidden.host.name">, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x000000061236c0>>, @device=#<LogStashLogger::Device::UDP:0x00000006123dc8 @sync=nil, @error_logger=#<Logger:0x00000006123d50 @progname=nil, @level=0, @default_formatter=#<Logger::Formatter:0x00000006123d28 @datetime_format=nil>, @formatter=nil, @logdev=#<Logger::LogDevice:0x00000006123cd8 @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<IO:/home/deploy/apps/grouvi_api/shared/log/puma_error.log>, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x00000006123cb0>>>, @buffer_group=nil, @buffer_max_items=nil, @buffer_max_interval=nil, @drop_messages_on_flush_error=false, @drop_messages_on_full_buffer=true, @buffer_flush_at_exit=true, @buffer_logger=nil, @buffer_config={:max_items=>50, :max_interval=>5, :logger=>nil, :autoflush=>nil, :has_on_flush_error=>false, :has_on_full_buffer_receive=>true, :drop_messages_on_flush_error=>false, :drop_messages_on_full_buffer=>true, :flush_at_exit=>true}, @pending_mutex=#<Thread::Mutex:0x00000006123a30>, @flush_mutex=#<Thread::Mutex:0x000000061239b8>, @flush_timer_thread=#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>, @buffer_state={:pending_items=>{}, :pending_count=>0, :pending_mutex=>#<Thread::Mutex:0x00000006123a30>, :outgoing_items=>{}, :outgoing_count=>0, :flush_mutex=>#<Thread::Mutex:0x000000061239b8>, :last_flush=>"2018-03-16T05:44:35.045+00:00", :timer=>#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>}, @port="5010", @host="hidden.host.name">, @silencer=true>

What I discovered is that logstash logger object has dead threads when lograge tries to send logs, as you can see in logs above. That dead thread is a a flush_timer_thread in LogStashLogger::Buffer module, which has responsibility to periodically flush the IO buffer.

There are 3 possible workarounds I found:

If you remove "preload_app!" from puma.rb config file, those threads don't become dead and logs are sent properly.
In lograge if you manually flush the buffer putting logger.device.buffer_flush(force: true) after the following line it starts sending logs. https://github.com/roidrage/lograge/blob/master/lib/lograge/log_subscriber.rb#L16
In puma.rb add following to reset buffer

on_worker_boot do
  Lograge.logger.device.reset_buffer
end

dwbutler commented 6 years ago

@esen Do you have the configuration recommended in https://github.com/dwbutler/logstash-logger#cleaning-up-resources-when-forking ? This should be equivalent to your workaround #3.

esen commented 6 years ago

@dwbutler Yes I tried that configuration, but got following error

undefined method 'reset' for #<ActiveSupport::Logger:0x00000006633f48> (NoMethodError)

I'm using Rails 5

dwbutler commented 6 years ago

@esen Is your Rails.logger a LogStashLogger? Are you using the built in Rails configuration DSL to set up your LogStashLogger?

esen commented 6 years ago

@dwbutler No, my logger is Lograge logger, but it uses LogStashLogger to log to logstash

sheldonh commented 5 years ago

Even outside the Rails context, something is up with TCP logging. When the peer resets the connection for inactivity, LogstashLogger receives Errno::EPIPE. The events in the flush that induced the exception are lost instead of being duplicated.

This is really easy to test with logit.io, which has very aggressive client timeouts.

Edit: in the test code disclosed in this issue thread, a 1 second sleep has been used. That pretty much guarantees that the test will never run into disconnects.

Edit: I've followed up on #34, which is a better home for this.

brendenyule commented 4 years ago

After looking at the code some more, I realized that just calling reconnect will not actually clean up the timer thread on the buffer. I exposed a new reset method in #138 which should actually initialize a new timer thread (as well as cleaning up any other resources allocated by the logger).

As for the issue with flush, my thought is to have a configuration option which disables manual flushing for buffered devices. If this option is turned on, calling flush will have no effect. The flush will happen anyway on a timer.

@dwbutler sorry for not being available for helping with this earlier. Do you believe TCP logging will work correctly for us now? I can try the latest version on one of our servers and try to repro if you want. Thanks for all of the help, we really appreciate it.

dwbutler commented 4 years ago

Do you believe TCP logging will work correctly for us now? I can try the latest version on one of our servers and try to repro if you want. Thanks for all of the help, we really appreciate it.

@brendenyule Yes, that would be great.

brendenyule commented 4 years ago

Alright, I will get this going when I catch a moment.

dwbutler / logstash-logger

TCP logging breaks rails when connection breaks #126