Closed juledwar closed 2 years ago
I also note that the failure handler is not called in this scenario. I think there needs to be some notification that the job died and was re-enqueued.
I can reproduce this at will with a simple Task that does something like this:
@tasks.task(name='display')
def display(arg):
logger.debug(f"DOING: {arg}")
import time
time.sleep(125)
logger.debug(f"DONE: {arg}")
I wait until I see the DOING log, then kill the broker and restart it. After the dead broker threshold expires, I see a message as above saying the worker was marked as dead and 0 jobs are re-enqueued.
@NicolasLM Do you have any thoughts on this before I spend time trying to trace it?
I have got to the bottom of this. In my example, there's no "max_retries" defined, so it never gets moved to a new broker. This got me looking more closely at the enqueue_jobs_from_dead_broker.lua
script and I can see that it also doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for me, and because the failure handler is not called it means jobs can be left in an inconsistent state.
The easiest and most obvious change to make is to remove the max_retries check. It makes some sense as I don't think the fact that a broker died should be considered an inherent job failure.
Hi, as you figured it is actually the intended behavior. With Spinach max_retries == 0
means the task is not idempotent and should never be retried in case of failure. Since Spinach cannot know which user's code is safe to rerun, it defaults to assuming that all tasks are not safe to be retried, hence why tasks needs to opt-in to be rerun in case of a dead broker.
I'm filing this as a placeholder and will fill in more info as and when I can as the data I have is quite sparse and I'm not sure where else to look at the moment.
However - we're seeing jobs that don't get recovered in some cases where a worker thread dies. Since most of our jobs are I/O bound, we have many threads inside one process, and after a worker dies, we get messages like this (in docker-compose logs):
Note that the same worker is mentioned more than once, and the subsequent log lines don't recover anything. This makes me think a loop is going wrong somewhere and iterating over the same worker object.
I can reproduce this at will in my application, however I don't yet know how to boil it down to a simple test case.