Sometimes, jobs from dead workers are not recovered

juledwar commented 2 years ago

I'm filing this as a placeholder and will fill in more info as and when I can as the data I have is quite sparse and I'm not sure where else to look at the moment.

However - we're seeing jobs that don't get recovered in some cases where a worker thread dies. Since most of our jobs are I/O bound, we have many threads inside one process, and after a worker dies, we get messages like this (in docker-compose logs):

helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 1 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued 
helios_1        | Worker 2c2db37f-f554-4a60-85c3-b24e2c2c2dc6 on 846e083b0422 marked as dead, 0 jobs were re-enqueued

Note that the same worker is mentioned more than once, and the subsequent log lines don't recover anything. This makes me think a loop is going wrong somewhere and iterating over the same worker object.

I can reproduce this at will in my application, however I don't yet know how to boil it down to a simple test case.

juledwar commented 2 years ago

I also note that the failure handler is not called in this scenario. I think there needs to be some notification that the job died and was re-enqueued.

juledwar commented 2 years ago

I can reproduce this at will with a simple Task that does something like this:

@tasks.task(name='display')
def display(arg):
    logger.debug(f"DOING: {arg}")
    import time
    time.sleep(125)
    logger.debug(f"DONE: {arg}")

I wait until I see the DOING log, then kill the broker and restart it. After the dead broker threshold expires, I see a message as above saying the worker was marked as dead and 0 jobs are re-enqueued.

juledwar commented 2 years ago

@NicolasLM Do you have any thoughts on this before I spend time trying to trace it?

juledwar commented 2 years ago

I have got to the bottom of this. In my example, there's no "max_retries" defined, so it never gets moved to a new broker. This got me looking more closely at the enqueue_jobs_from_dead_broker.lua script and I can see that it also doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for me, and because the failure handler is not called it means jobs can be left in an inconsistent state.

The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure.

NicolasLM commented 2 years ago

Hi, as you figured it is actually the intended behavior. With Spinach max_retries == 0 means the task is not idempotent and should never be retried in case of failure. Since Spinach cannot know which user's code is safe to rerun, it defaults to assuming that all tasks are not safe to be retried, hence why tasks needs to opt-in to be rerun in case of a dead broker.

NicolasLM / spinach

Sometimes, jobs from dead workers are not recovered #14