bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.6k stars 193 forks source link

Notifier keepalives #1417

Open bensheldon opened 1 month ago

bensheldon commented 1 month ago

tldr: maybe GoodJob's Notifier should emit a noop SQL query every N seconds to keep the connection alive. Also

does anyone have experience with good_job deployed on fly.io? here's my problem: I've an app deployed there with good_job running on it. after 30 or so minutes, the process crash with this error: ``` PQsocket() can't get socket descriptor ... PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad) ``` I read that people had the same issue and seems like Fly's Postgres proxy has a 30min timeout or something like that, and kills idle connections (e.g. no job ran in the past 30min). I solved by enqueuing a cron job that collect some pg metrics every 10min. now, after a couple days with clear skies, this error popped back up and the process crashed..except it didn't? Processes tab shows no processes at all but all my jobs are up and running, being enqueued and everything is working as expected. Related: https://community.fly.io/t/postgres-dropping-connection-after-30s-when-using-listen-notify/12188

I noticed the sequence of operations in the Notifier is:

So if the connection is dead, maybe GoodJob is still trying to do those further operations on it, and makes it worse. Like if it has a ConnectionBad, GoodJob should just throw it away rather than trying to clean up the connection.

But the underlying issue here is...

When the postgres connection is in LISTEN mode and just looping with wait_for_notify(1.second) the connection doesn't look active to the database/proxy so it closes the connection after 60 seconds. So that means maybe GoodJob should do something dumb like doing SELECT 1 as one every 15 seconds or something and that would be enough to not cut the connection. Does that seem valid?

luizkowalski commented 1 month ago

After 4 days of running without issues on fly.io, it happened again (and it is happening every hour)

Opened a thread on Fly's community forum

luizkowalski commented 1 month ago

I have some more findings that I think it's worth mentioning.

So, I have a cron task that runs every 30min and collects some data. now, back to square one, the processes "died", disappearing from the list but still running. What I did then was stop the machine on Fly and restart it, which brought back up the process on the list, but strangely enough I'm seeing these errors too:

2024-07-16T11:30:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T11:30:00.009660 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

almost like there's a phantom process still picking up the cron task...I can't really say it's this because Fly always uses the PID 314 so that's what's always printed in the logs

now, this is interesting: I mentioned the job runs every 30min and I'm looking at the timestamps of things:

2024-07-16T13:30:00.059 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:30:00.059861 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 15:30:00 +0200.

2024-07-16T13:59:59.975 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:59:59.975112 #314] INFO -- : Notifier unsubscribed with UNLISTEN

2024-07-16T13:59:59.979 app[28674d2c9214e8] ams [info] I, [2024-07-16T13:59:59.979226 #314] INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) to GoodJob(default)

2024-07-16T13:59:59.986 app[28674d2c9214e8] ams [info] E, [2024-07-16T13:59:59.981809 #314] ERROR -- : Notifier errored: ActiveRecord::ConnectionNotEstablished: PQsocket() can't get socket descriptor

2024-07-16T13:59:59.986 app[28674d2c9214e8] ams [info] ["/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:56:in `exec'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:56:in `block (2 levels) in raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:1004:in `block in with_raw_connection'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/concurrency/null_lock.rb:9:in `synchronize'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:976:in `with_raw_connection'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:55:in `block in raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications/instrumenter.rb:58:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/sentry-rails-5.18.1/lib/sentry/rails/tracing.rb:54:in `instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb:1119:in `log'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:54:in `raw_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:538:in `internal_execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:137:in `execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/abstract/query_cache.rb:26:in `execute'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activerecord/lib/active_record/connection_adapters/postgresql/database_statements.rb:48:in `execute'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:239:in `block (5 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications.rb:210:in `block in instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications/instrumenter.rb:58:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/sentry-rails-5.18.1/lib/sentry/rails/tracing.rb:54:in `instrument'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/notifications.rb:210:in `instrument'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:237:in `block (4 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/callbacks.rb:110:in `run_callbacks'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:235:in `block (3 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/bundler/gems/rails-f74dc0d8aacf/activesupport/lib/active_support/execution_wrapper.rb:91:in `wrap'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:234:in `ensure in block (2 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:234:in `block (2 levels) in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:260:in `with_connection'", "/rails/vendor/ruby/3.3.0/gems/good_job-4.0.3/lib/good_job/notifier.rb:201:in `block in create_listen_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:24:in `block in execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `block in synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/synchronization/mutex_lockable_object.rb:48:in `synchronize'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/safe_task_executor.rb:22:in `execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/ivar.rb:170:in `safe_execute'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/scheduled_task.rb:298:in `process_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/timer_set.rb:98:in `block in ns_post_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:359:in `run_task'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:350:in `block (3 levels) in create_worker'", "<internal:kernel>:187:in `loop'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:341:in `block (2 levels) in create_worker'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `catch'", "/rails/vendor/ruby/3.3.0/gems/concurrent-ruby-1.3.3/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:340:in `block in create_worker'"]

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] E, [2024-07-16T14:00:00.009459 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] DETAIL: Key (cron_key, cron_at)=(query_stats, 2024-07-16 14:00:00) already exists.

2024-07-16T14:00:00.009 app[28674d2c9214e8] ams [info] )

2024-07-16T14:00:04.993 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:04.993913 #314] INFO -- : Notifier subscribed with LISTEN

2024-07-16T14:00:08.528 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.528055 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [b3701d5e-ef07-4041-b907-7b470971c67b] Performing CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) from GoodJob(default) enqueued at 2024-07-16T13:59:59.966956766Z

2024-07-16T14:00:08.575 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.575681 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [b3701d5e-ef07-4041-b907-7b470971c67b] Performed CaptureQueryStatsJob (Job ID: b3701d5e-ef07-4041-b907-7b470971c67b) from GoodJob(default) in 47.68ms

2024-07-16T14:00:08.580 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.576431 #314] INFO -- : Executed GoodJob b3701d5e-ef07-4041-b907-7b470971c67b

2024-07-16T14:00:08.591 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:00:08.591800 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 16:00:08 +0200.

2024-07-16T14:29:59.973 app[28674d2c9214e8] ams [info] D, [2024-07-16T14:29:59.973645 #314] DEBUG -- : Notifier received payload: {"queue_name":"default","scheduled_at":"2024-07-16T16:29:59.967+02:00"}

2024-07-16T14:29:59.975 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:29:59.975119 #314] INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) to GoodJob(default)

2024-07-16T14:29:59.988 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:29:59.987955 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [c1a9c689-f2b8-4576-aeba-7202d3072e8e] Performing CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) from GoodJob(default) enqueued at 2024-07-16T14:29:59.967247969Z

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] E, [2024-07-16T14:30:00.012721 #314] ERROR -- : [ActiveJob] Failed enqueuing CaptureQueryStatsJob to GoodJob(default): ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_good_jobs_on_cron_key_and_cron_at_cond"

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] DETAIL: Key (cron_key, cron_at)=(query_stats, 2024-07-16 14:30:00) already exists.

2024-07-16T14:30:00.012 app[28674d2c9214e8] ams [info] )

2024-07-16T14:30:00.033 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.033113 #314] INFO -- : [ActiveJob] [CaptureQueryStatsJob] [c1a9c689-f2b8-4576-aeba-7202d3072e8e] Performed CaptureQueryStatsJob (Job ID: c1a9c689-f2b8-4576-aeba-7202d3072e8e) from GoodJob(default) in 45.22ms

2024-07-16T14:30:00.033 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.033415 #314] INFO -- : Executed GoodJob c1a9c689-f2b8-4576-aeba-7202d3072e8e

2024-07-16T14:30:00.049 app[28674d2c9214e8] ams [info] I, [2024-07-16T14:30:00.049093 #314] INFO -- : GoodJob destroyed 0 preserved job execution records finished before 2024-07-02 16:30:00 +0200.

13:59:59.975112: milliseconds prior to the cron task being enqueued/processed: UNLISTEN 13:59:59.979226: Cron task tries to enqueue 13:59:59.986: Error 14:00:00.009459 Failed to enqueue job: unique index validation 14:30:00.033113: Job picked up (by God knows what) and processed

luizkowalski commented 1 month ago

Testing Good Job 4.1 now, I will let it run for some time, see if it fails again or not

luizkowalski commented 1 month ago

Running without issues for 2 days now. seems like the keepalive fixed it!

bensheldon commented 1 month ago

🙌 Yay! Thank you so much for reporting this and the follow-up!

luizkowalski commented 1 month ago

thank you as always for the quick help and fix!