Open 2called-chaos opened 2 years ago
That looks like some kind of weird connection sharing incident that isn't suppose to happen. DJ automatically disconnects before fork and reconnects after. Maybe you have a thread sneaking in hitting the connection after we disconnect but before we fork.
I'm not quite sure what exactly is at fault here but let me explain.
We use delayed_job(_active_record) for almost a decade without any real issues. We have DJs on multiple servers started with
--monitor --prefix tomatic --number_of_workers 5
. Everything runs on Ubuntu (currently 20.04) and MariaDB (via mysql2).We upgraded servers once and one in thousand restarts an issue occurs:
One DJ master seemingly bugs out and all its workers (and all jobs) fail like this
This kills the worker but due to
--monitor
they eventually restart (still rogue). The job record however ends up like thisIt is locked but does not appear to have failed (last_error, failed_at). So this rogue DJ will just lock tasks when it gets the chance which will effectively delay these tasks for max_runtime at a minimum.
We barely every noticed this as it only happened so rarely and we deployed rather often.
A few years go down and we started to use coverband to get some more insights on our old codebase. We reported a bug which resulted in log spam and as soon as it was fixed DJ went rogue a lot of the time. So we are currently version pinned.
Also consider this:
From: https://groups.google.com/g/sequel-talk/c/domXZyPmvLw/m/NWCJRrbaAAAJ
Now we tried upgrading to Ruby 3.x and suddenly we started to run into this issue again. Different server, virtually the same server environment (same os, memcache, mariadb although not replicated) I cannot reproduce this race condition on our staging server/env except for one time.
Somewhere is a race condition and I don't know what to do. I'm not sure if it has something to do with daemons or mysql or if we do something abhorrently wrong (which worked for like 10 years) but it seems to happen before forking as workers keep getting respawned in a rogue state until we actually restart DJ. Any ideas?