When the next-in-line worker dies, the entire cluster stops processing tasks.

jonathan-golorry commented 4 years ago

I think I've found the cause of https://github.com/Koed00/django-q/issues/218 and https://github.com/Koed00/django-q/issues/200. Django Q generally cycles through which worker gets the next task. If the next worker in that cycle gets killed, the cluster will fail to give out any future tasks, even if the worker got reincarnated. This will also prevent the cluster from shutting down properly (I've been killing the guard task and then killing the remaining tasks).

This is a major problem in 1-worker clusters, where the worker dying always causes the cluster to stop functioning. I suspect 2-worker clusters are much more than twice as reliable because workers are most likely to die while processing a task. A good stopgap would be to put a minimum on the default number of workers (which is currently set by multiprocessing.cpu_count()).

Here's a log of a 2-worker cluster:

Cluster initialization:

04:47:49 [Q] INFO Q Cluster-44650 starting.
04:47:50 [Q] INFO Process-1:1 ready for work at 44656
04:47:50 [Q] INFO Process-1:2 ready for work at 44657
04:47:50 [Q] INFO Process-1 guarding cluster at 44655
04:47:50 [Q] INFO Process-1:3 monitoring at 44658
04:47:50 [Q] INFO Q Cluster-44650 running.
04:47:50 [Q] INFO Process-1:4 pushing tasks at 44659

Alternating workers:

04:47:50 [Q] INFO Process-1:1 processing [earth-edward-jig-oregon]
04:47:50 [Q] INFO Process-1:2 processing [sad-pennsylvania-kansas-carolina]
04:47:50 [Q] INFO Process-1:1 processing [coffee-two-bakerloo-nevada]
04:47:50 [Q] INFO Processed [sad-pennsylvania-kansas-carolina]
04:47:50 [Q] INFO Processed [earth-edward-jig-oregon]
04:47:50 [Q] INFO Processed [coffee-two-bakerloo-nevada]
04:48:03 [Q] INFO Process-1:2 processing [alabama-alanine-nitrogen-colorado]
04:48:03 [Q] INFO Processed [alabama-alanine-nitrogen-colorado]
04:48:07 [Q] INFO Process-1:1 processing [undress-speaker-speaker-friend]
04:48:07 [Q] INFO Processed [undress-speaker-speaker-friend]

1:2 is next, so killing 1:1 doesn't cause an issue. Both 1:2 and 1:5 process tasks.

04:48:21 [Q] ERROR reincarnated worker Process-1:1 after death
04:48:21 [Q] INFO Process-1:5 ready for work at 44860
04:48:25 [Q] INFO Process-1:2 processing [nevada-butter-juliet-high]
04:48:25 [Q] INFO Processed [nevada-butter-juliet-high]
04:48:41 [Q] INFO Process-1:5 processing [montana-oscar-asparagus-charlie]
04:48:41 [Q] INFO Processed [montana-oscar-asparagus-charlie]

1:2 is next, so killing 1:5 is safe.

04:49:03 [Q] ERROR reincarnated worker Process-1:5 after death
04:49:03 [Q] INFO Process-1:6 ready for work at 45089
04:49:07 [Q] INFO Process-1:2 processing [johnny-triple-crazy-queen]
04:49:07 [Q] INFO Processed [johnny-triple-crazy-queen]
04:49:09 [Q] INFO Process-1:6 processing [mockingbird-fish-single-comet]
04:49:09 [Q] INFO Processed [mockingbird-fish-single-comet]

Again, 1:2 is next, so killing 1:7 is safe.

04:49:25 [Q] ERROR reincarnated worker Process-1:6 after death
04:49:25 [Q] INFO Process-1:7 ready for work at 45218
04:49:31 [Q] INFO Process-1:2 processing [thirteen-white-princess-burger]
04:49:31 [Q] INFO Processed [thirteen-white-princess-burger]
04:49:33 [Q] INFO Process-1:7 processing [winter-iowa-uniform-yellow]
04:49:33 [Q] INFO Processed [winter-iowa-uniform-yellow]
04:49:35 [Q] INFO Process-1:2 processing [three-idaho-comet-fix]
04:49:35 [Q] INFO Processed [three-idaho-comet-fix]

Now 1:7 is next, so killing 1:7 causes the cluster to stop processing tasks.

04:49:49 [Q] ERROR reincarnated worker Process-1:7 after death
04:49:49 [Q] INFO Process-1:8 ready for work at 45369
(Cluster stopped receiving tasks)

Here's proof that you can kill all the original workers and still have things working:

05:18:34 [Q] INFO Q Cluster-52176 starting.
05:18:34 [Q] INFO Process-1:1 ready for work at 52183
05:18:34 [Q] INFO Process-1:2 ready for work at 52184
05:18:34 [Q] INFO Process-1 guarding cluster at 52182
05:18:34 [Q] INFO Process-1:3 monitoring at 52185
05:18:34 [Q] INFO Q Cluster-52176 running.
05:18:34 [Q] INFO Process-1:4 pushing tasks at 52186
05:18:34 [Q] INFO Process-1:1 processing [carbon-ten-zebra-fillet]
05:18:34 [Q] INFO Process-1:2 processing [jersey-three-nineteen-nineteen]
05:18:34 [Q] INFO Processed [carbon-ten-zebra-fillet]
05:18:34 [Q] INFO Processed [jersey-three-nineteen-nineteen]
05:18:38 [Q] INFO Process-1:2 processing [july-ceiling-network-comet]
05:18:38 [Q] INFO Processed [july-ceiling-network-comet]
05:18:42 [Q] INFO Process-1:1 processing [september-lemon-social-april]
05:18:42 [Q] INFO Processed [september-lemon-social-april]
05:19:08 [Q] ERROR reincarnated worker Process-1:1 after death
05:19:08 [Q] INFO Process-1:5 ready for work at 52384
05:19:11 [Q] INFO Process-1:2 processing [magnesium-robert-william-ten]
05:19:11 [Q] INFO Processed [magnesium-robert-william-ten]
05:19:12 [Q] INFO Process-1:5 processing [fruit-don-sodium-cup]
05:19:12 [Q] INFO Processed [fruit-don-sodium-cup]
05:19:17 [Q] INFO Process-1:2 processing [seven-bravo-river-washington]
05:19:17 [Q] INFO Processed [seven-bravo-river-washington]
05:19:24 [Q] ERROR reincarnated worker Process-1:2 after death
05:19:24 [Q] INFO Process-1:6 ready for work at 52491
05:19:27 [Q] INFO Process-1:5 processing [uncle-vegan-sodium-finch]
05:19:27 [Q] INFO Processed [uncle-vegan-sodium-finch]
05:20:22 [Q] ERROR reincarnated worker Process-1:6 after death
05:20:22 [Q] INFO Process-1:7 ready for work at 52782
(Cluster stopped receiving tasks)

Here's a 10-worker cluster that I brought down with the same pattern:

05:23:03 [Q] INFO Q Cluster-53249 starting.
05:23:03 [Q] INFO Process-1:1 ready for work at 53256
05:23:03 [Q] INFO Process-1:2 ready for work at 53257
05:23:03 [Q] INFO Process-1:3 ready for work at 53258
05:23:03 [Q] INFO Process-1:4 ready for work at 53259
05:23:03 [Q] INFO Process-1:5 ready for work at 53260
05:23:03 [Q] INFO Process-1:6 ready for work at 53261
05:23:03 [Q] INFO Process-1:7 ready for work at 53262
05:23:03 [Q] INFO Process-1:8 ready for work at 53263
05:23:03 [Q] INFO Process-1:9 ready for work at 53264
05:23:04 [Q] INFO Process-1 guarding cluster at 53255
05:23:04 [Q] INFO Process-1:10 ready for work at 53265
05:23:04 [Q] INFO Process-1:11 monitoring at 53266
05:23:04 [Q] INFO Q Cluster-53249 running.
05:23:04 [Q] INFO Process-1:12 pushing tasks at 53267
05:23:04 [Q] INFO Process-1:1 processing [florida-wolfram-leopard-lamp]
05:23:04 [Q] INFO Processed [florida-wolfram-leopard-lamp]
05:23:14 [Q] INFO Process-1:2 processing [dakota-undress-maryland-mars]
05:23:15 [Q] INFO Processed [dakota-undress-maryland-mars]
05:23:21 [Q] INFO Process-1:3 processing [oklahoma-snake-one-tango]
05:23:21 [Q] INFO Processed [oklahoma-snake-one-tango]
05:23:22 [Q] INFO Process-1:4 processing [summer-lamp-eighteen-foxtrot]
05:23:22 [Q] INFO Processed [summer-lamp-eighteen-foxtrot]
05:23:23 [Q] INFO Process-1:5 processing [pizza-north-delaware-bulldog]
05:23:23 [Q] INFO Processed [pizza-north-delaware-bulldog]
05:23:23 [Q] INFO Process-1:6 processing [louisiana-shade-aspen-eleven]
05:23:23 [Q] INFO Processed [louisiana-shade-aspen-eleven]
05:23:24 [Q] INFO Process-1:7 processing [cardinal-echo-blue-river]
05:23:24 [Q] INFO Processed [cardinal-echo-blue-river]
05:23:25 [Q] INFO Process-1:8 processing [happy-mirror-cup-zebra]
05:23:25 [Q] INFO Processed [happy-mirror-cup-zebra]
05:23:26 [Q] INFO Process-1:9 processing [butter-mississippi-snake-princess]
05:23:26 [Q] INFO Processed [butter-mississippi-snake-princess]
05:23:27 [Q] INFO Process-1:10 processing [georgia-indigo-queen-glucose]
05:23:27 [Q] INFO Processed [georgia-indigo-queen-glucose]
05:23:28 [Q] INFO Process-1:1 processing [vegan-hydrogen-louisiana-twelve]
05:23:28 [Q] INFO Processed [vegan-hydrogen-louisiana-twelve]
05:23:29 [Q] INFO Process-1:2 processing [may-wyoming-papa-india]
05:23:29 [Q] INFO Processed [may-wyoming-papa-india]
05:23:30 [Q] INFO Process-1:3 processing [network-six-bulldog-yellow]
05:23:30 [Q] INFO Processed [network-six-bulldog-yellow]
05:23:39 [Q] ERROR reincarnated worker Process-1:4 after death
05:23:39 [Q] INFO Process-1:13 ready for work at 53543
(Cluster stopped receiving tasks)

When tasks are received at the same time, the pattern is less clear. Both of these examples broke the cluster:

05:06:44 [Q] INFO Q Cluster-50031 starting.
05:06:44 [Q] INFO Process-1:1 ready for work at 50037
05:06:44 [Q] INFO Process-1 guarding cluster at 50036
05:06:44 [Q] INFO Process-1:2 ready for work at 50038
05:06:44 [Q] INFO Process-1:3 monitoring at 50039
05:06:44 [Q] INFO Q Cluster-50031 running.
05:06:44 [Q] INFO Process-1:4 pushing tasks at 50040
05:06:44 [Q] INFO Process-1:1 processing [red-floor-lactose-stairway]
05:06:44 [Q] INFO Process-1:2 processing [friend-pennsylvania-robin-sad]
05:06:44 [Q] INFO Process-1:1 processing [georgia-black-football-single]
05:06:44 [Q] INFO Process-1:2 processing [sodium-cat-oven-xray]
05:06:44 [Q] INFO Processed [friend-pennsylvania-robin-sad]
05:06:44 [Q] INFO Processed [red-floor-lactose-stairway]
05:06:44 [Q] INFO Processed [sodium-cat-oven-xray]
05:06:44 [Q] INFO Processed [georgia-black-football-single]
05:07:09 [Q] ERROR reincarnated worker Process-1:2 after death
05:07:09 [Q] INFO Process-1:5 ready for work at 50186
(Cluster stopped receiving tasks)

05:14:54 [Q] INFO Q Cluster-51780 starting.
05:14:54 [Q] INFO Process-1:1 ready for work at 51786
05:14:54 [Q] INFO Process-1:2 ready for work at 51787
05:14:54 [Q] INFO Process-1 guarding cluster at 51785
05:14:54 [Q] INFO Process-1:3 monitoring at 51788
05:14:54 [Q] INFO Q Cluster-51780 running.
05:14:54 [Q] INFO Process-1:4 pushing tasks at 51789
05:14:54 [Q] INFO Process-1:1 processing [avocado-wyoming-yellow-fix]
05:14:54 [Q] INFO Process-1:2 processing [illinois-charlie-foxtrot-nitrogen]
05:14:54 [Q] INFO Processed [avocado-wyoming-yellow-fix]
05:14:54 [Q] INFO Processed [illinois-charlie-foxtrot-nitrogen]
05:15:07 [Q] ERROR reincarnated worker Process-1:2 after death
05:15:07 [Q] INFO Process-1:5 ready for work at 51872
(Cluster stopped receiving tasks)

jonathan-golorry commented 4 years ago

Is there anyone who knows what might be causing this? Increasing the worker count isn't helping much and it's causing me problems in production. I'm trying to go through the source code, but I'm not very familiar with python multiprocessing.

jonathan-golorry commented 4 years ago

It looks like the design of using a guard to reincarnate workers after a crash is fundamentally flawed. If the worker that died was holding the lock on the shared Queue, the Queue will stay locked forever. This is expected: https://bugs.python.org/issue20527

The only way I can think of to recover from this situation is to trigger a Sentinel restart if we detect a permanently locked Queue. Detecting a locked Queue can be done by putting timeouts on the task_queue.get calls and checking task_queue.empty == False. This also lets us detect workers that are ready to stop (and not in the middle of a task).

Solanar commented 4 years ago

Alternative would be to add a signal handler to workers that releases any locks.

willaaam commented 3 years ago

I seem to be facing the same issue, is there any news on improvements?

Moon-developer commented 3 years ago

I seem to be facing the same issue, is there any news on improvements?

Did you manage to come up with a fix for this? I see this issue is rare and goes back to 2016 :sweat_smile: currently facing the same thing and not looking forward to spending much time wondering what's causing it.

aliensowo commented 2 years ago

bump resolve this problem plz

Shwetabhk commented 1 year ago

We've been having the same problem. The cluster randomly stops processing new items once a worker dies. Was someone able to solve it in any way?

GDay commented 1 year ago

@Shwetabhk this likely requires a full rewrite. Django-q is entirely based on shared queues. When one queue gets locked (due to a failure of one of the workers), it won't be able to recover. Using pipes instead of queues is one of the potential solutions. I have been working on that in this PR: https://github.com/django-q2/django-q2/pull/78. It's not fully baked yet, but the PoC is working.

Another potential solution would be to restart the entire cluster once a queue gets locked (and can't be read within x amount of time), but that's not bulletproof and could result in data loss on tasks that recently got done.

Koed00 / django-q

When the next-in-line worker dies, the entire cluster stops processing tasks. #394