Delayed job stops, the jobs are stuck in postgres

collectiveidea / delayed_job_active_record

ActiveRecord backend integration for DelayedJob 3.0+

MIT License

343 stars 337 forks source link

Delayed job stops, the jobs are stuck in postgres #195

Open Chandananimmu opened 3 years ago

Chandananimmu commented 3 years ago

@albus522 I'm using delayed job 4.1.9 , delayed_job_active_record 4.1.6 , when I try to send 10,000 emails via delayed job, it get stucks and the delayed job stops working, job will be in postgres, could you please help me to resolve this.

kaylareopelle commented 2 years ago

I think I may be having a similar issue! I'm interested to learn about any solutions folks have found.

davidkrider commented 2 years ago

I get these in my log:

I, [2022-02-10T21:29:22.975757 #3950] INFO -- : 2022-02-10T21:29:22-0500: [Worker(delayed_job host:miner pid:3950)] Error while reserving job: PG::UnableToSend: SSL SYSCALL error: EOF detected I, [2022-02-10T21:29:27.977649 #3950] INFO -- : 2022-02-10T21:29:27-0500: [Worker(delayed_job host:miner pid:3950)] Error while reserving job: PG::UnableToSend: no connection to the server ... x9 F, [2022-02-10T21:30:07.993164 #3950] FATAL -- : PG::UnableToSend: no connection to the server

I get 9 copies of that second message, then it gives up, and the delayed_job daemon falls over.

brijeshs-atharvasystem commented 1 year ago

I am also facing the same issue for specific a queue only.

delayed_job (4.1.11)
delayed_job_active_record (4.1.7)
Postgres 15

In my case, it is not happening with every job. I have a few queues and this issue is happening in only one queue. Even in a single queue, jobs are performed and deleted most of the time but sometimes, jobs are not deleted. Due to this, pending jobs are stuck and not processed and I received this error.

Error: execution expired (Delayed::Worker.max_run_time is only 14400 seconds) (Delayed::WorkerTimeout)

Note: I have been facing this issue since I upgraded the Postgres version from 11 to 15.

Please let me know if anyone found the solution for this issue. Thanks

davidkrider commented 1 year ago

My app and database run in Azure. In my case, I finally figured out that Microsoft was updating my Postgres instance and restarting it without any notification. (This seems crazy to me.) This was killing my long-running jobs, and leaving me with this error message. I thought the messages were confusing, because they were telling me that the database had dropped, but it seemed perfectly fine. However, they were actually correct. I finally found a setting on the Postgres service to limit updates and restarts to only critical fixes, and this has stopped happening for me.

brijeshs-atharvasystem commented 1 year ago

@davidkrider Thanks for the reply. In my case, I upgraded the Postgres version to 15. I am not sure that is the cause of this issue. If there is any issue with the Postgres version then other queues are also affected which is not the case here.

I observed that the job is performed but the job is not unlocked (locked_at is not updated) so it is not deleted.

brijeshs-atharvasystem commented 1 year ago

It seems like there was some issue with Ruby 2.7.3. I downgraded the version to 2.6.5 and all queues are working fine.