collectiveidea / delayed_job

Database based asynchronous priority queue system -- Extracted from Shopify
http://groups.google.com/group/delayed_job
MIT License
4.82k stars 954 forks source link

Race condition but how to debug? #1179

Open 2called-chaos opened 2 years ago

2called-chaos commented 2 years ago

I'm not quite sure what exactly is at fault here but let me explain.

We use delayed_job(_active_record) for almost a decade without any real issues. We have DJs on multiple servers started with --monitor --prefix tomatic --number_of_workers 5. Everything runs on Ubuntu (currently 20.04) and MariaDB (via mysql2).

We upgraded servers once and one in thousand restarts an issue occurs:

One DJ master seemingly bugs out and all its workers (and all jobs) fail like this

I, [2022-11-02T14:43:25.302511 #2521391]  INFO -- : 2022-11-02T14:43:25+0100: [Worker(delayed_job.4 host:www6 pid:2521391)] Job Order.save_ipinfo (id=25491141) RUNNING
E, [2022-11-02T14:43:25.323137 #2521391] ERROR -- : 2022-11-02T14:43:25+0100: [Worker(delayed_job.4 host:www6 pid:2521391)] Job Order.save_ipinfo (id=25491141) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql2::Error::ConnectionError: Got packets out of order

This kills the worker but due to --monitor they eventually restart (still rogue). The job record however ends up like this

2022-11-14 14 11 16

It is locked but does not appear to have failed (last_error, failed_at). So this rogue DJ will just lock tasks when it gets the chance which will effectively delay these tasks for max_runtime at a minimum.

We barely every noticed this as it only happened so rarely and we deployed rather often.


A few years go down and we started to use coverband to get some more insights on our old codebase. We reported a bug which resulted in log spam and as soon as it was fixed DJ went rogue a lot of the time. So we are currently version pinned.

Also consider this:

When we include a logger in the delayed_job startup script the job succeeds. When we do not include a logger in the delayed_job startup script we get the error (full stack attached): Mysql2::Error: Got packets out of order

From: https://groups.google.com/g/sequel-talk/c/domXZyPmvLw/m/NWCJRrbaAAAJ


Now we tried upgrading to Ruby 3.x and suddenly we started to run into this issue again. Different server, virtually the same server environment (same os, memcache, mariadb although not replicated) I cannot reproduce this race condition on our staging server/env except for one time.

Somewhere is a race condition and I don't know what to do. I'm not sure if it has something to do with daemons or mysql or if we do something abhorrently wrong (which worked for like 10 years) but it seems to happen before forking as workers keep getting respawned in a rogue state until we actually restart DJ. Any ideas?

albus522 commented 3 months ago

That looks like some kind of weird connection sharing incident that isn't suppose to happen. DJ automatically disconnects before fork and reconnects after. Maybe you have a thread sneaking in hitting the connection after we disconnect but before we fork.