closeio / tasktiger

Python task queue using Redis
MIT License
1.42k stars 80 forks source link

Only error tasks if the failure is within the task implementation #300

Open nsaje opened 1 year ago

nsaje commented 1 year ago

Errors like TaskImportError or errors stemming from a broken child context manager shouldn't move the task to the ERROR state - it should remain in QUEUED.

This way when we somehow break TaskTiger outside of the task implementation we don't need to worry about which tasks should be retried and make recovery far easier.

To implement this we'd need a way for the child process to signal an "abort". Currently we return exit code 0 in case of success and 1 in case of error. We could add exit code 2 to signify an abortion of task processing, in which case the parent process could move the task back into the QUEUED state.

WDYT @thomasst ?

tsx commented 1 year ago

Not all TaskImportErrors are equal. There are two kinds:

Unless we have a reliable way to distinguish between the two kinds of failures (honestly I don't see how a worker can know whether the task is expected to come in the future or not - we don't have the crystal balls necessary for that), I wouldn't recommend leaving those tasks as Queued.

That said, keeping a list of tasks that have failed with an import error, and retrying them on next worker restart (redeployment presumably) might make recovery (in both scenarios) easier by reducing the need for manually tracking down tasks and removing them.

As for the signalling mechanism - I don't think using exit codes is the best idea. I think the child process can just catch the appropriate ImportError exceptions and write some richer state/report to TT redis instead of trying to communicate through a 1-bit channel. It already stores the execution data from the forked child, so why not add it there.

nsaje commented 1 year ago

I think both of those cases would be better served by leaving the task in Queued state but still loudly logging an error (rollbar etc.). This way the manual intervention needed would be reduced to just fixing the issue in the code, whereas right now you also have to figure out exactly which tasks failed because of this issue and retry them manually.

The parent process handles the state/queue transitions so we do need to get that info back from the child somehow. Passing it via a Redis entry sounds fine as well, I don't have a strong opinion on it.

thomasst commented 1 year ago

Do you have an example where a broken child context manager caused a TaskImportError?

nsaje commented 1 year ago

No, there's an "or" in there :-) e.g. either TaskImportError or any error caused by the context managers.