NicolasLM / spinach

Modern Redis task queue for Python 3
https://spinach.readthedocs.io
BSD 2-Clause "Simplified" License
63 stars 4 forks source link

Signal job failures when recovering dead jobs #16

Closed bigjools closed 2 years ago

bigjools commented 2 years ago

The enqueue_jobs_from_dead_broker.lua script doesn't re-enqueue jobs if the max_retries was exceeded. This is very surprising behaviour for most people, and because the failure handler is not called it means jobs can be left in an inconsistent state.

This change will make sure that the failure handler is called and the job moved to the FAILED state.

Drive-by: Add functional tests for concurrency that I forgot to git add previously.

Drive-by: Add tags file to .gitignore

Drive-by: add flake8 to tox venv dependencies so that vscode works better in that venv

Fixes: https://github.com/NicolasLM/spinach/issues/14

juledwar commented 2 years ago

Ok so this version catches failed jobs as a cause of the dead worker, and sends appropriate signals.

bigjools commented 2 years ago
helios_1        | Worker bbed04e7-113a-4153-9760-dfa6235fa5b9 on 2a57c732afed marked as dead, 0 jobs were re-enqueued
helios_1        | Error during execution 3/3 of Job <display FAILED cad958c4-4d54-4f6e-a59e-142997505dc2> after 0 ms
helios_1        | Exception: Worker 2a57c732afed died and max_retries exceeded

Tested in anger on an actual deployment and my job was eventually marked failed correctly.

bigjools commented 2 years ago

@NicolasLM Would you care to review and merge this please? Would you also consider giving us write access to your repo so we can manage some of this?

NicolasLM commented 2 years ago

Sorry for the delay, I've been quite busy and this fell off my radar.

I am all for making changes so that failed jobs always call the failure handler, however it seems to remove an inconsistency by introducing another one:

The second option seems even more surprising to be than the current behavior. Please let me know your thoughts.

NicolasLM commented 2 years ago

@bigjools I gave you write access to the repository if you need it.

bigjools commented 2 years ago

Thanks for replying Nicolas and also thanks for giving me write access, I promise to be careful :) I'm not sure if you want to give me write access to pypi as well so I can make releases?

You raise a good point about the non-idempotent jobs not raising a signal. Given that we're desperate to get this change released and into production, and the existing behaviour has not changed for non-idempotent jobs (no new inconsistency has been introduced!), I'm inclined to file a ticket and leave it to a follow-up branch. We don't make use of these types of jobs at all and consider them fair fodder for unacknowledged death, but I can see that at least it should send a signal like it would do under normal circumstances.