Closed brendenyule closed 3 years ago
Hi,
What version of LogStashLogger are you using? More recent versions will drop log messages when the buffer is full, by default. Older versions would block once the buffer got full.
We are using one of the brand new versions:
gem 'logstash-logger', '0.23.0'
We were using an older version, but upgraded when this issue came up.
I created a simple test rig to try to reproduce the problem, and could not. You can see my results here: https://gist.github.com/dwbutler/8f1a779c5b29055c9d1c2bb2d6d607fb
I ran the experiment for more than five minutes. I observed that the log file was still being written to, indicating that my Ruby process was not blocked.
The only unusual thing I noticed was the memory usage slowly growing over time. Since LogStashLogger will try to reconnect to the TCP server every 5 seconds, it's possible this is leading to some kind of memory growth.
It might be helpful for you to set a custom error_logger
and buffer_logger
. By default, errors are logged to STDERR. It would also be helpful if you could create a sample app that reproduces the issue.
@brendenyule I ran my test rig again but this time replaced the single-threaded loop with a more realistic simulation of a production web server:
30.times do
Thread.new do
loop { logger.info("#{Thread.current.object_id} - #{Time.now}"); sleep(1); }
end
end
I ran it for quite some time, but did not observe any deadlocking behavior.
It might help to know what version of Ruby, Passenger, and Rails you are running.
Since you say things work fine with UDP but not TCP, the only other thing I can possibly think of is that Ruby's implementation of TCP sockets holds onto the TCP file descriptor long after the connection is closed, but not UDP sockets. You can check and see if you're leaking connections / file descriptors by checking lsof
on your server. If you do find that TCP connections take a while before the OS actually gets rid of them, your only recourse might be to increase your file descriptor limit.
LogStashLogger will keep trying to flush the buffer every 5 seconds (or even more frequently if you have a high volume of log messages), and will try to open a new connection every time it retries. It can be overly aggressive in this regard, and there is no exponential backoff algorithm in place. So it wouldn't surprise me to find out that, if the OS takes (say) 20 minutes to actually get rid of a closed TCP connection, that hundreds of zombie connections pile up in a matter of minutes. This could easily bump you over the default limit of 1024 file descriptors.
So it will probably also be helpful to know what OS you're running since this behavior might be OS specific. (I tested on a Macbook Pro.)
According to this, even after a TCP client has closed the connection, the OS will keep the connection open for 4 minutes in TIME_WAIT
status "just in case" any more packets arrive from the server so they don't disrupt the next client that wants to use the same port.
Sorry for the wait.
We are using: Red Hat 4.8.3-9 phusion passenger 4.0.59 rails 4 ruby 2.2.3p173
What mixture of processes vs threads are you using? This would tell us how potentially how many connections are being opened. And did you check the output of lsof
to see if there were a large number of connections in TIME_WAIT
status?
I think my app is showing some unexpected behaviour that is related to this issue.
We are seeing log writing failed.
messages in our logs and they are connected to Rack Timeout entries that we are seeing on our tracing tool (Datadog).
The logs have the following:
source=rack-timeout id=828cc4688e4051bc81f242bbe0eeb251 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=b61da0d0d8fc1bd52b6fa52b1fc21510 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=35b9e0c9a21b043c561477e27a6fedc5 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=a77929294d3749bf6f0f087a26bd3bab timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=47c345b3d51fb2fd6b59ebef9fa91bbb timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
source=rack-timeout id=3cda3e9dc73c2719c3b1bdd67cf8df80 timeout=5000ms service=5000ms state=timed_out at=error
log writing failed. Request ran for longer than 5000ms
We are using TCP
logging with puma set to 2 workers and this is the message displayed when booting up the server:
Puma starting in cluster mode...
* Version 3.10.0 (ruby 2.3.5-p376), codename: Russell's Teapot
* Min threads: 5, max threads: 5
* Environment: production
* Process workers: 2
* Preloading application
* Listening on tcp://0.0.0.0:3000
! WARNING: Detected 1 Thread(s) started in app boot:
! #<Thread:0x00000000039f2998@/usr/local/bundle/gems/logstash-logger-0.25.0/lib/logstash-logger/buffer.rb:297 sleep> - /usr/local/bundle/gems/logstash-logger-0.25.0/lib/logstash-logger/buffer.rb:299:in `sleep'
That points to https://github.com/dwbutler/logstash-logger/blob/9e8369e2d6a97ac45daf42506e045095e5744ba5/lib/logstash-logger/buffer.rb#L297
Even the simple HealthcheckController
is timing out and we are clueless on what's going on, I just changed it to UDP
to see if that address the issue the other thing I am considering is decreasing workers to 1 so puma does not run in clustered mode.
Hopefully this data is useful when trying to find the origin of the issue. :)
Thanks for the detailed information. That makes me suspect that the LogStashLogger buffer is being initialized during app boot, but before a worker process is forked. This would create a background thread. Then, the app server is forking a worker process but reusing the same Thread. This would be a Bad Thing. Mixing threading and forking always results in all kinds of misbehavior.
It would probably be best to reinitialize LogStashLogger after a worker process forks. I believe this can be accomplished in the following way (assuming Rails):
Passenger:
::PhusionPassenger.on_event(:starting_worker_process) do |forked|
Rails.logger.reconnect
end
Puma:
# In config/puma.rb
on_worker_boot do
Rails.logger.reconnect
end
Let me know if this helps resolve the issue.
@dwbutler thanks for the quick reply will try this soon, I just got a stack trace from some other service I also maintain:
E, [2017-11-20T15:35:31.255065 #8] ERROR -- : [LogStashLogger::Device::TCP] Errno::ETIMEDOUT - Connection timed out
2017-11-20 15:35:31 +0000: Rack app error handling request { GET / }
#<Rack::Timeout::RequestTimeoutError: Request ran for longer than 15000ms>
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:32:in `initialize'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:32:in `new'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:32:in `non_ssl_connect'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/tcp.rb:23:in `connect'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:117:in `with_connection'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:100:in `write_batch'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:67:in `flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/buffer.rb:243:in `block in buffer_flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/buffer.rb:240:in `each'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/buffer.rb:240:in `buffer_flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/device/connectable.rb:64:in `flush'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/logger.rb:20:in `flush'
/usr/local/bundle/gems/activesupport-5.0.6/lib/active_support/log_subscriber.rb:70:in `flush_all!'
/usr/local/bundle/gems/lograge-0.4.1/lib/lograge/rails_ext/rack/logger.rb:17:in `call_app'
/usr/local/bundle/gems/railties-5.0.6/lib/rails/rack/logger.rb:24:in `block in call'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/tagged_logging.rb:5:in `block in tagged'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/tagged_logging.rb:16:in `tagged'
/usr/local/bundle/gems/logstash-logger-0.22.1/lib/logstash-logger/tagged_logging.rb:5:in `tagged'
/usr/local/bundle/gems/railties-5.0.6/lib/rails/rack/logger.rb:24:in `call'
/usr/local/bundle/gems/request_store-1.3.2/lib/request_store/middleware.rb:9:in `call'
/usr/local/bundle/gems/actionpack-5.0.6/lib/action_dispatch/middleware/request_id.rb:24:in `call'
/usr/local/bundle/gems/rack-2.0.3/lib/rack/runtime.rb:22:in `call'
/usr/local/bundle/gems/rack-timeout-0.4.2/lib/rack/timeout/core.rb:122:in `block in call'
/usr/local/bundle/gems/rack-timeout-0.4.2/lib/rack/timeout/support/timeout.rb:19:in `timeout'
/usr/local/bundle/gems/rack-timeout-0.4.2/lib/rack/timeout/core.rb:121:in `call'
/usr/local/bundle/gems/activesupport-5.0.6/lib/active_support/cache/strategy/local_cache_middleware.rb:28:in `call'
/usr/local/bundle/gems/actionpack-5.0.6/lib/action_dispatch/middleware/executor.rb:12:in `call'
/usr/local/bundle/gems/rack-2.0.3/lib/rack/sendfile.rb:111:in `call'
/usr/local/bundle/gems/ddtrace-0.8.2/lib/ddtrace/contrib/rack/middlewares.rb:89:in `call'
/usr/local/bundle/gems/railties-5.0.6/lib/rails/engine.rb:522:in `call'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/configuration.rb:226:in `call'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/server.rb:578:in `handle_request'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/server.rb:415:in `process_client'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/server.rb:275:in `block in run'
/usr/local/bundle/gems/puma-3.7.0/lib/puma/thread_pool.rb:120:in `block in spawn_thread'
E, [2017-11-20T15:50:41.591086 #6] ERROR -- : [LogStashLogger::Device::TCP] Errno::ETIMEDOUT - Connection timed out
In this case I am not using the latest gems, but the problem seems similar TCP connection times out and the request itself times out because the flush is synchronous(?). We are using logz.io for as the logging service I wonder if that is caused by them not being highly available or network errors.
I changed from TCP to UDP and the problem went away, then I ran another experiment I kept the TCP logging but changed the puma configuration from clustered (multiple workers), to single mode (single worker, multiple threads) and the problem also disappeared.
Because my EC2 instances are all single core I am happier with TCP plus single worker combo.
@cabello If you look carefully at the stacktrace above, you'll see that Lograge is involved here. The specific line of code is here: https://github.com/roidrage/lograge/blob/ced03c55419b7f131706b476b26be8c5cd90041b/lib/lograge/rails_ext/rack/logger.rb#L17
After each request, Lograge is forcing a flush. This is done in the foreground (application) thread. Normally, LogStashLogger will do regular flushes from a background thread in order to avoid interfering with the application thread and potentially blocking it.
It doesn't surprise me that UDP doesn't suffer from the same issue, because UDP is designed to not wait for a response.
I wasn't aware of this, but apparently Rails::Rack::Logger
is actually designed to flush all logs after every request. See https://github.com/rails/rails/blob/010e246756c09f44e901f4fd8e8eab2cb3022e95/railties/lib/rails/rack/logger.rb#L43
In that case, LogstashLogger::Device::Connectable#flush
ought to be designed to flush in the background by default.
After looking at the code some more, I realized that just calling reconnect
will not actually clean up the timer thread on the buffer. I exposed a new reset
method in #138 which should actually initialize a new timer thread (as well as cleaning up any other resources allocated by the logger).
As for the issue with flush
, my thought is to have a configuration option which disables manual flushing for buffered devices. If this option is turned on, calling flush
will have no effect. The flush will happen anyway on a timer.
I'm having a problem even with UDP. Here are logs when I start a web server:
DEBUG [939165e8] [6147] Puma starting in cluster mode... [6147] * Version 3.10.0 (ruby 2.3.1-p112), codename: Russell's Teapot [6147] * Min threads: 10, max threads: 10 [6147] * Environment: staging [6147] * Process workers: 1 DEBUG [939165e8] [6147] * Listening on tcp://0.0.0.0:9292 DEBUG [939165e8] [6147] ! WARNING: Detected 3 Thread(s) started in app boot: DEBUG [939165e8] [6147] ! #<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 sleep> - /home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:301:in
sleep'
DEBUG [939165e8] [6147] ! #<Thread:0x00000006f31910@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/server/monitor.rb:121 sleep> - /home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/server/monitor.rb:123:in sleep' DEBUG [939165e8] [6147] ! #<Thread:0x00000006f31690@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/cluster/cursor_reaper.rb:166 sleep> - /home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/mongo-2.4.3/lib/mongo/cluster/cursor_reaper.rb:168:in
sleep'
DEBUG [939165e8] [6147] * Daemonizing...
`
As you can see, there are 2 workers and application is preloaded.
I'm using 'lograge' gem, which uses logstash-logger to log rails API logs to logstash.
I put Rails.logger.info('logger is: ' + logger.inspect.to_s)
line above following line
https://github.com/roidrage/lograge/blob/master/lib/lograge/log_subscriber.rb#L16
What I see is
I, [2018-03-16T05:44:44.502472 #6444] INFO -- : logger is: #<Logger:0x00000006123760 @progname=nil, @level=0, @default_formatter=#<Logger::Formatter:0x00000006123738 @datetime_format=nil>, @formatter=#<LogStashLogger::Formatter::JsonLines:0x000000039a1a20 @customize_event=nil, @datetime_format=nil>, @logdev=#<Logger::LogDevice:0x000000061236e8 @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<LogStashLogger::Device::UDP:0x00000006123dc8 @sync=nil, @error_logger=#<Logger:0x00000006123d50 @progname=nil, @level=0, @default_formatter=#<Logger::Formatter:0x00000006123d28 @datetime_format=nil>, @formatter=nil, @logdev=#<Logger::LogDevice:0x00000006123cd8 @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<IO:/home/deploy/apps/grouvi_api/shared/log/puma_error.log>, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x00000006123cb0>>>, @buffer_group=nil, @buffer_max_items=nil, @buffer_max_interval=nil, @drop_messages_on_flush_error=false, @drop_messages_on_full_buffer=true, @buffer_flush_at_exit=true, @buffer_logger=nil, @buffer_config={:max_items=>50, :max_interval=>5, :logger=>nil, :autoflush=>nil, :has_on_flush_error=>false, :has_on_full_buffer_receive=>true, :drop_messages_on_flush_error=>false, :drop_messages_on_full_buffer=>true, :flush_at_exit=>true}, @pending_mutex=#<Thread::Mutex:0x00000006123a30>, @flush_mutex=#<Thread::Mutex:0x000000061239b8>, @flush_timer_thread=#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>, @buffer_state={:pending_items=>{}, :pending_count=>0, :pending_mutex=>#<Thread::Mutex:0x00000006123a30>, :outgoing_items=>{}, :outgoing_count=>0, :flush_mutex=>#<Thread::Mutex:0x000000061239b8>, :last_flush=>"2018-03-16T05:44:35.045+00:00", :timer=>#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>}, @port="5010", @host="hidden.host.name">, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x000000061236c0>>, @device=#<LogStashLogger::Device::UDP:0x00000006123dc8 @sync=nil, @error_logger=#<Logger:0x00000006123d50 @progname=nil, @level=0, @default_formatter=#<Logger::Formatter:0x00000006123d28 @datetime_format=nil>, @formatter=nil, @logdev=#<Logger::LogDevice:0x00000006123cd8 @shift_size=nil, @shift_age=nil, @filename=nil, @dev=#<IO:/home/deploy/apps/grouvi_api/shared/log/puma_error.log>, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x00000006123cb0>>>, @buffer_group=nil, @buffer_max_items=nil, @buffer_max_interval=nil, @drop_messages_on_flush_error=false, @drop_messages_on_full_buffer=true, @buffer_flush_at_exit=true, @buffer_logger=nil, @buffer_config={:max_items=>50, :max_interval=>5, :logger=>nil, :autoflush=>nil, :has_on_flush_error=>false, :has_on_full_buffer_receive=>true, :drop_messages_on_flush_error=>false, :drop_messages_on_full_buffer=>true, :flush_at_exit=>true}, @pending_mutex=#<Thread::Mutex:0x00000006123a30>, @flush_mutex=#<Thread::Mutex:0x000000061239b8>, @flush_timer_thread=#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>, @buffer_state={:pending_items=>{}, :pending_count=>0, :pending_mutex=>#<Thread::Mutex:0x00000006123a30>, :outgoing_items=>{}, :outgoing_count=>0, :flush_mutex=>#<Thread::Mutex:0x000000061239b8>, :last_flush=>"2018-03-16T05:44:35.045+00:00", :timer=>#<Thread:0x00000006123940@/home/deploy/apps/grouvi_api/shared/bundle/ruby/2.3.0/gems/logstash-logger-0.26.1/lib/logstash-logger/buffer.rb:299 dead>}, @port="5010", @host="hidden.host.name">, @silencer=true>
What I discovered is that logstash logger object has dead threads when lograge tries to send logs, as you can see in logs above. That dead thread is a a flush_timer_thread in LogStashLogger::Buffer module, which has responsibility to periodically flush the IO buffer.
There are 3 possible workarounds I found:
If you remove "preload_app!" from puma.rb config file, those threads don't become dead and logs are sent properly.
In lograge if you manually flush the buffer putting logger.device.buffer_flush(force: true)
after the following line it starts sending logs.
https://github.com/roidrage/lograge/blob/master/lib/lograge/log_subscriber.rb#L16
In puma.rb add following to reset buffer
on_worker_boot do
Lograge.logger.device.reset_buffer
end
@esen Do you have the configuration recommended in https://github.com/dwbutler/logstash-logger#cleaning-up-resources-when-forking ? This should be equivalent to your workaround #3.
@dwbutler Yes I tried that configuration, but got following error
undefined method 'reset' for #<ActiveSupport::Logger:0x00000006633f48> (NoMethodError)
I'm using Rails 5
@esen Is your Rails.logger a LogStashLogger? Are you using the built in Rails configuration DSL to set up your LogStashLogger?
@dwbutler No, my logger is Lograge logger, but it uses LogStashLogger to log to logstash
Even outside the Rails context, something is up with TCP logging. When the peer resets the connection for inactivity, LogstashLogger receives Errno::EPIPE
. The events in the flush that induced the exception are lost instead of being duplicated.
This is really easy to test with logit.io
, which has very aggressive client timeouts.
Edit: in the test code disclosed in this issue thread, a 1 second sleep has been used. That pretty much guarantees that the test will never run into disconnects.
Edit: I've followed up on #34, which is a better home for this.
After looking at the code some more, I realized that just calling
reconnect
will not actually clean up the timer thread on the buffer. I exposed a newreset
method in #138 which should actually initialize a new timer thread (as well as cleaning up any other resources allocated by the logger).As for the issue with
flush
, my thought is to have a configuration option which disables manual flushing for buffered devices. If this option is turned on, callingflush
will have no effect. The flush will happen anyway on a timer.
@dwbutler sorry for not being available for helping with this earlier. Do you believe TCP logging will work correctly for us now? I can try the latest version on one of our servers and try to repro if you want. Thanks for all of the help, we really appreciate it.
Do you believe TCP logging will work correctly for us now? I can try the latest version on one of our servers and try to repro if you want. Thanks for all of the help, we really appreciate it.
@brendenyule Yes, that would be great.
Alright, I will get this going when I catch a moment.
If we stop our ELK server while our app is running, our app will eventually stop working altogether after about 5 minutes. We are using Passenger Standalone and it doesn't even seem to notice that anything is wrong, don't see any helpful errors anywhere.
If we use UDP, an ELK outage is handled gracefully.
Here is our configuration for our QA environment: