bensheldon / good_job

Multithreaded, Postgres-based, Active Job backend for Ruby on Rails.
https://goodjob-demo.herokuapp.com/
MIT License
2.59k stars 193 forks source link

Some jobs don't start immediately after scheduled - LISTEN/NOTIFY issue? #1419

Open amitkaz opened 1 month ago

amitkaz commented 1 month ago

I have been testing GJ for a queue of infrequent jobs, each job taking ~20 seconds to complete.

Most jobs are starting immediately, but sometimes there's a few seconds delay - which I assume means that the LISTEN/NOTIFY failed to notify a worker and the job was picked with the polling interval.

Queue has at most 5 jobs running concurrently (most of the time just 1 job running concurrently)

Attached below is an example of 9 jobs in a span of 13 minutes:

js2 = GoodJob::Job.where(scheduled_at: DateTime.new(2024, 7, 14, 15, 40)..DateTime.new(2024, 7, 14, 15, 53)).to_a
js2.sort_by(&:scheduled_at).each { |j| puts [j.scheduled_at, j.performed_at, j.finished_at].join(" | ") }}

Scheduled  | Performed| Finished
---------------------------------
1 15:33:55 | 15:33:55 | 15:34:10
2 15:36:41 | 15:36:41 | 15:36:53
3 15:36:48 | 15:36:48 | 15:37:19
4 15:41:50 | 15:41:50 | 15:42:03
5 15:44:29 | 15:44:34 | 15:44:47
6 15:46:29 | 15:46:29 | 15:46:47
7 15:49:07 | 15:49:07 | 15:49:26
8 15:51:44 | 15:51:44 | 15:51:54
9 15:52:38 | 15:52:43 | 15:53:00

Jobs 1-4, 6-8 started immediately Jobs 5,9 started ~5 seconds after scheduled, and there are no other jobs running at these times

Other info

GoodJob is configured as async with poll_interval of 10 (so ~5 seconds delay is suspiciously in the middle!), every other parameter is default.

Running on heroku performance dyno and PUMA with WEB_CONCURRENCY=4 I've changed puma.rb according to https://github.com/bensheldon/good_job?tab=readme-ov-file#execute-jobs-async--in-process

There wasn't any meaningful load on the web server at the time.

Any idea what can be the issue causing the jobs to not start immediately?

Thank you ❤️

bensheldon commented 1 month ago

That's interesting/unexpected! Could you share also the created_at of the jobs too in your table? I wanted to see if these were scheduled to run in the future or not. Also, could you query GoodJob::Execution just to doublecheck that nothing else is truly running at the same time?

Just for some background: when in async, the process that enqueues the job will try to execute it locally, if there isn't local capacity (any available execution threads), it will then emit a NOTIFY to other LISTENing processes. And then, if the job is scheduled in the future, those processes will put the future-scheduled time in a cache (this is a Concurrent Ruby thing) and the process will awaken at (or around) that time and try to execute a job (it checks for the next available job rather than specifically that job)

amitkaz commented 1 month ago

Thank you for the quick response!

Yup all jobs where not scheduled in the future and I've confirmed that the created_at is the same as scheduled_at (first column)

15:33:55 | 15:33:55 | 15:33:55 | 15:34:10
15:36:41 | 15:36:41 | 15:36:41 | 15:36:53
15:36:48 | 15:36:48 | 15:36:48 | 15:37:19
15:41:50 | 15:41:50 | 15:41:50 | 15:42:03
15:44:29 | 15:44:29 | 15:44:34 | 15:44:47
15:46:29 | 15:46:29 | 15:46:29 | 15:46:47
15:49:07 | 15:49:07 | 15:49:07 | 15:49:26
15:51:44 | 15:51:44 | 15:51:44 | 15:51:54
15:52:38 | 15:52:38 | 15:52:43 | 15:53:00
15:52:43 | 15:52:43 | 15:52:44 | 15:53:00

This is the result for GoodJob::Execution (looks exactly the same)

 15:33:55 | 15:33:55 | 15:33:55 | 15:34:10
15:36:41 | 15:36:41 | 15:36:41 | 15:36:53
15:36:48 | 15:36:48 | 15:36:48 | 15:37:19
15:41:50 | 15:41:50 | 15:41:50 | 15:42:03
15:44:29 | 15:44:29 | 15:44:34 | 15:44:47
15:46:29 | 15:46:29 | 15:46:29 | 15:46:47
15:49:07 | 15:49:07 | 15:49:07 | 15:49:26
15:51:44 | 15:51:44 | 15:51:44 | 15:51:54
15:52:38 | 15:52:38 | 15:52:43 | 15:53:00
15:52:43 | 15:52:43 | 15:52:44 | 15:53:00

So if I'm understanding to your explanation there shouldn't actually be any NOTIFY given that no other job is running, and there should be capacity to run it locally?

Is there anything else that I can check to see if there should be capacity or is it just the GoodJob::Execution table? also I'm on GJ 3.30 (I know you made some changes recently!)

if it did start we'd see executions_count be greater then 1 right? (it is 1, see below) - so does it mean it didn't execute at all locally then was picked up by the poll interval?

Also confirmed that I dont see any errors during that time on bugsnag (unless its something that is so bad that it doesn't even notify)

#<GoodJob::Execution:0x00007f97394af358
 id: "f89100eb-acc5-4dcd-9200-f575099632f8",
 queue_name: "default",
 priority: 0,
 serialized_params:
  {"job_id"=>"f89100eb-acc5-4dcd-9200-f575099632f8",
   "locale"=>"en",
   "priority"=>nil,
   "timezone"=>"UTC",
   "arguments"=>[1299, "d51fbba6-0529-4f50-b9a8-070d18e2dac2"],
   "job_class"=>"AdvSearchJob",
   "executions"=>0,
   "queue_name"=>"default",
   "enqueued_at"=>"2024-07-14T15:44:29.960717929Z",
   "scheduled_at"=>nil,
   "provider_job_id"=>nil,
   "exception_executions"=>{}},
 scheduled_at: Sun, 14 Jul 2024 15:44:29.960949000 UTC +00:00,
 performed_at: Sun, 14 Jul 2024 15:44:34.537569000 UTC +00:00,
 finished_at: Sun, 14 Jul 2024 15:44:47.520852000 UTC +00:00,
 error: nil,
 created_at: Sun, 14 Jul 2024 15:44:29.960949000 UTC +00:00,
 updated_at: Sun, 14 Jul 2024 15:44:47.522813000 UTC +00:00,
 active_job_id: "f89100eb-acc5-4dcd-9200-f575099632f8",
 concurrency_key: nil,
 cron_key: nil,
 retried_good_job_id: nil,
 cron_at: nil,
 batch_id: nil,
 batch_callback_id: nil,
 is_discrete: true,
 executions_count: 1,
 job_class: "AdvSearchJob",
 error_event: nil,
 labels: nil,
 locked_by_id: nil,
 locked_at: nil>

Also if I understand correctly, a workaround would be to try to run it as "external" to use NOTIFY instead of running locally on the web server, and hope it works in case the issue is the "running locally" part?

bensheldon commented 1 month ago

Thanks for sharing those. That implies that yes, for some unknown reason jobs are not being run in a timely manner.

I would say the quickest option for you to reduce queue latency would be to try reducing your poll interval to 5 seconds, and see if that changes the timing you see.

For GoodJob generally: I'm... not sure. I'm assuming:

Assuming you've applied all the migrations, GoodJob v3.30 is pretty equivalent to 4.0.0; For 3.30 you want to be using GoodJob::DiscreteExecutions which is renamed to Executions in 4.0. You could go back to the executions table and see if the process_id column on GoodJob::DiscreteExecutions is all the same (which would imply only 1 process is executing all the jobs instead of them being distributed).

But I'm stumped tbh