Open jonathan-golorry opened 4 years ago
Is there anyone who knows what might be causing this? Increasing the worker count isn't helping much and it's causing me problems in production. I'm trying to go through the source code, but I'm not very familiar with python multiprocessing.
It looks like the design of using a guard to reincarnate workers after a crash is fundamentally flawed. If the worker that died was holding the lock on the shared Queue, the Queue will stay locked forever. This is expected: https://bugs.python.org/issue20527
The only way I can think of to recover from this situation is to trigger a Sentinel restart if we detect a permanently locked Queue. Detecting a locked Queue can be done by putting timeouts on the task_queue.get
calls and checking task_queue.empty == False
. This also lets us detect workers that are ready to stop (and not in the middle of a task).
Alternative would be to add a signal handler to workers that releases any locks.
I seem to be facing the same issue, is there any news on improvements?
I seem to be facing the same issue, is there any news on improvements?
Did you manage to come up with a fix for this? I see this issue is rare and goes back to 2016 :sweat_smile: currently facing the same thing and not looking forward to spending much time wondering what's causing it.
We've been having the same problem. The cluster randomly stops processing new items once a worker dies. Was someone able to solve it in any way?
@Shwetabhk this likely requires a full rewrite. Django-q is entirely based on shared queues. When one queue gets locked (due to a failure of one of the workers), it won't be able to recover. Using pipes instead of queues is one of the potential solutions. I have been working on that in this PR: https://github.com/django-q2/django-q2/pull/78. It's not fully baked yet, but the PoC is working.
Another potential solution would be to restart the entire cluster once a queue gets locked (and can't be read within x amount of time), but that's not bulletproof and could result in data loss on tasks that recently got done.
I think I've found the cause of https://github.com/Koed00/django-q/issues/218 and https://github.com/Koed00/django-q/issues/200. Django Q generally cycles through which worker gets the next task. If the next worker in that cycle gets killed, the cluster will fail to give out any future tasks, even if the worker got reincarnated. This will also prevent the cluster from shutting down properly (I've been killing the guard task and then killing the remaining tasks).
This is a major problem in 1-worker clusters, where the worker dying always causes the cluster to stop functioning. I suspect 2-worker clusters are much more than twice as reliable because workers are most likely to die while processing a task. A good stopgap would be to put a minimum on the default number of workers (which is currently set by
multiprocessing.cpu_count()
).Here's a log of a 2-worker cluster:
Cluster initialization:
Alternating workers:
1:2 is next, so killing 1:1 doesn't cause an issue. Both 1:2 and 1:5 process tasks.
1:2 is next, so killing 1:5 is safe.
Again, 1:2 is next, so killing 1:7 is safe.
Now 1:7 is next, so killing 1:7 causes the cluster to stop processing tasks.
Here's proof that you can kill all the original workers and still have things working:
Here's a 10-worker cluster that I brought down with the same pattern:
When tasks are received at the same time, the pattern is less clear. Both of these examples broke the cluster: