Erroring with "Too many open files" when good_job tries reconnecting to database

adibsaad commented 2 years ago

Problem

Database goes down indefinitely
GoodJob tries reconnecting periodically
Eventually it can't attempt reconnections because of this error

E, [2022-02-17T16:56:48.696603 #28524] ERROR -- : [GoodJob] Notifier errored: connection to server at "127.0.0.1", port 5432 failed: could not create socket: Too many open files

My guess is it's not closing the fail descriptor after failing to reconnect to the db.

Stack trace

/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/postgresql_adapter.rb:83:in `rescue in new_client'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/postgresql_adapter.rb:77:in `new_client'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/postgresql_adapter.rb:37:in `postgresql_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:882:in `public_send'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:882:in `new_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:926:in `checkout_new_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:905:in `try_to_checkout_new_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:866:in `acquire_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:588:in `checkout'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:428:in `connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:1128:in `retrieve_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_handling.rb:327:in `retrieve_connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/connection_handling.rb:283:in `connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/delegation.rb:93:in `connection'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/query_methods.rb:1394:in `preprocess_order_args'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/query_methods.rb:360:in `order!'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/query_methods.rb:355:in `order'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/good_job-2.9.6/lib/good_job/execution.rb:89:in `block in <class:Execution>'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:411:in `instance_exec'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:411:in `block in _exec_scope'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:804:in `_scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:411:in `_exec_scope'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/scoping/named.rb:176:in `block in scope'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/delegation.rb:67:in `block in priority_ordered'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:406:in `block in scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:804:in `_scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:406:in `scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/delegation.rb:67:in `priority_ordered'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/good_job-2.9.6/lib/good_job/execution.rb:157:in `perform_with_advisory_lock'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/delegation.rb:67:in `block in perform_with_advisory_lock'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:406:in `block in scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:804:in `_scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation.rb:406:in `scoping'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activerecord-6.1.4.4/lib/active_record/relation/delegation.rb:67:in `perform_with_advisory_lock'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/good_job-2.9.6/lib/good_job/job_performer.rb:27:in `next'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/good_job-2.9.6/lib/good_job/scheduler.rb:270:in `block (2 levels) in create_task'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activesupport-6.1.4.4/lib/active_support/execution_wrapper.rb:88:in `wrap'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activesupport-6.1.4.4/lib/active_support/reloader.rb:72:in `block in wrap'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activesupport-6.1.4.4/lib/active_support/execution_wrapper.rb:88:in `wrap'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/activesupport-6.1.4.4/lib/active_support/reloader.rb:71:in `wrap'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/good_job-2.9.6/lib/good_job/scheduler.rb:269:in `block in create_task'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:24:in `block in execute'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:47:in `block in synchronize'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:47:in `synchronize'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:47:in `synchronize'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:19:in `execute'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/ivar.rb:169:in `safe_execute'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/scheduled_task.rb:285:in `process_task'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/timer_set.rb:98:in `block in ns_post_task'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:363:in `run_task'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:352:in `block (3 levels) in create_worker'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:335:in `loop'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:335:in `block (2 levels) in create_worker'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334:in `catch'
/Users/adib/.rbenv/versions/2.7.4/lib/ruby/gems/2.7.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334:in `block in create_worker'

bensheldon commented 2 years ago

@adibsaad thanks for reporting this! that's not expected. I'll see if I can figure out what's happening here.

Briefly poking at it, I expected there to be a problem with the Notifier holding open file descriptors, but it also appears that execution threads are as well. This is unexpected because the execution thread database connections are wholly managed by Rails' execution wrapper, which afaik are supposed to manage connection lifecycle.

bensheldon commented 2 years ago

Here is what I'm seeing running lsof -p [PID] with good_job executable (I'd expect similar when running async too):

ruby    50923 bensheldon   12u   IPv6 0x935f30b3386e4bab         0t0                 TCP localhost:53047->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   13u   IPv6 0x935f30b333d2596b         0t0                 TCP localhost:53048->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   14u   IPv6 0x935f30b333d2294b         0t0                 TCP localhost:53049->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   15u   IPv6 0x935f30b333d2302b         0t0                 TCP localhost:53051->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   16u   IPv6 0x935f30b333d2370b         0t0                 TCP localhost:53052->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   17u   IPv6 0x935f30b333d23deb         0t0                 TCP localhost:53053->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   18u   IPv6 0x935f30b3386c604b         0t0                 TCP localhost:53054->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   19u   IPv6 0x935f30b3386c3deb         0t0                 TCP localhost:53055->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   20u   IPv6 0x935f30b3386c294b         0t0                 TCP localhost:53056->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   21u   IPv6 0x935f30b3386c302b         0t0                 TCP localhost:53057->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   22u   IPv6 0x935f30b333d2b02b         0t0                 TCP localhost:53059->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   23u   IPv6 0x935f30b333d2b70b         0t0                 TCP localhost:53060->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   24u   IPv6 0x935f30b333d2bdeb         0t0                 TCP localhost:53061->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   25u   IPv6 0x935f30b333d2cbab         0t0                 TCP localhost:53066->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   26u   IPv6 0x935f30b333d2d28b         0t0                 TCP localhost:53067->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   27u   IPv6 0x935f30b333d2e04b         0t0                 TCP localhost:53069->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   28u   IPv6 0x935f30b333d2a94b         0t0                 TCP localhost:53070->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   29u   IPv6 0x935f30b3386dcbab         0t0                 TCP localhost:53071->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   30u   IPv6 0x935f30b3386dd96b         0t0                 TCP localhost:53072->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   31u   IPv6 0x935f30b3386da94b         0t0                 TCP localhost:53073->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   32u   IPv6 0x935f30b3386d84cb         0t0                 TCP localhost:53074->localhost:postgresql (CLOSED)
ruby    50923 bensheldon   33u   IPv6 0x935f30b3386c128b         0t0                 TCP localhost:53079->localhost:postgresql (CLOSED)

I'm a bit stumped because when looking at a normal rails s and trying to hit the webserver, I'm seeing the same Postgres error and similar stacktrace but it's not leaving behind a file descriptor.

bensheldon commented 2 years ago

Doing some more research and experimentation. It looks like the file descriptors are cleaned up when the socket objects are garbage collected:

❯ bin/rails c
Loading development environment (Rails 6.1.4.6)
[1] pry(main)> ObjectSpace.each_object(PG::Connection).count
=> 0
[2] pry(main)> ObjectSpace.each_object(BasicSocket).count
=> 0
[3] pry(main)> 10.times { ActiveRecord::Base.connection rescue nil }
=> 10
[4] pry(main)> ObjectSpace.each_object(PG::Connection).count
=> 10
[5] pry(main)> ObjectSpace.each_object(BasicSocket).count
=> 10
[6] pry(main)> GC.start
=> nil
[7] pry(main)> ObjectSpace.each_object(PG::Connection).count
=> 0
[8] pry(main)> ObjectSpace.each_object(BasicSocket).count
=> 0

I can't find an obvious way to clean up the file descriptors without manually triggering GC. When debugging this, I was frequently interrupted by the GC running (even in the middle of a binding.pry). There's a very hacky way to release the file descriptors, but it relies on deprecated methods and seems very OS/implementation specific:

[1] pry(main)> ActiveRecord::Base.connection
ActiveRecord::ConnectionNotEstablished: connection to server at "::1", port 5432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

from /Users/ben@codeforamerica.org/.rbenv/versions/3.0.3/lib/ruby/gems/3.0.0/gems/activerecord-6.1.4.6/lib/active_record/connection_adapters/postgresql_adapter.rb:83:in `rescue in new_client'
Caused by PG::ConnectionBad: connection to server at "::1", port 5432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

from /Users/ben@codeforamerica.org/.rbenv/versions/3.0.3/lib/ruby/gems/3.0.0/gems/pg-1.3.1/lib/pg/connection.rb:637:in `async_connect_or_reset'
[2] pry(main)> conn = ObjectSpace.each_object(PG::Connection).first
=> #<PG::Connection:0x000000013791bdc8>

# The `#socket` method is deprecated
[3] pry(main)> fd = conn.socket
=> 13
[4] pry(main)> io = IO.for_fd(fd)
=> #<IO:fd 13>
[5] pry(main)> io.close
=> nil
[6] pry(main)> io
=> #<IO:(closed)>

@adibsaad are you doing anything custom with the Ruby Garbage Collector? e.g. GC.disable or changing any of the GC constants?

adibsaad commented 2 years ago

Nope, not doing any GC magic. Forgot to mention, I'm on OSX Monterey.

bensheldon commented 2 years ago

It looks like the root cause here is pg. When pg fails to connect, it does not clean up the resulting connection. I will open a ticket upstream.

bensheldon commented 2 years ago

Tracking https://github.com/ged/ruby-pg/issues/439

bensheldon commented 2 years ago

Fix was released in pg v1.3.4.

bensheldon / good_job

Erroring with "Too many open files" when good_job tries reconnecting to database #530