Scheduling jobs with retries and invalid parameters runs the full retry count

0xDEC0DE commented 5 years ago

Version

Spinach 0.0.10

Steps to Reproduce

A toy test case, note the missing required params in the schedule call:

from spinach import Engine, MemoryBroker
spin = Engine(MemoryBroker())
@spin.task(name='compute', max_retries=3)
def compute(a, b):
    print('Computed {} + {} = {}'.format(a, b, a + b))

spin.schedule('compute')
spin.start_workers()

Expected Result

Ideally, the job is aborted, since it's been scheduled incorrectly and can never start, much less finish successfully. Failing that, log messages stating what went wrong in between retries would be useful

Actual Result

The worker quietly swallows the exception and reschedules the task and only logs the last error:

Error during execution 1/4 of Job <compute NOT_SET f33d7b4a-edda-46eb-8a40-e4414876f61e> after 0 ms, retry in 4 s
Error during execution 2/4 of Job <compute NOT_SET f33d7b4a-edda-46eb-8a40-e4414876f61e> after 0 ms, retry in 9 s
Error during execution 3/4 of Job <compute NOT_SET f33d7b4a-edda-46eb-8a40-e4414876f61e> after 0 ms, retry in 20 s
Error during execution 4/4 of Job <compute FAILED f33d7b4a-edda-46eb-8a40-e4414876f61e> after 0 ms
Traceback (most recent call last):
  File "/tmp/venv/lib/python3.7/site-packages/spinach/worker.py", line 74, in _worker_func
    job.task_func(*job.task_args, **job.task_kwargs)
TypeError: compute() missing 2 required positional arguments: 'a' and 'b'

This takes a LONG time, provides no useful context until the retry limit has been hit, clogs up the task queue with garbage, and only gets worse the higher max_retries is.

Workaround

Redefining all task functions with kwargs instead of args and having the task check/raise a spinach.task.AbortException in the event a bad invocation largely dodges this, at the cost of a lot of boilerplate checking code in the tasks.

NicolasLM commented 5 years ago

Hi, thank you for the detailed bug report, it's super helpful.

The worker quietly swallows the exception and reschedules the task and only logs the last error.

Definitely not providing an helpful message until the last failure is wrong. This will be easy to fix.

Ideally, the job is aborted, since it's been scheduled incorrectly and can never start.

There is an edge case to keep in mind: two different versions of an application may run simultaneously on the same queue.

Version A may have compute(a)
Version B changes it to compute(a, b)
For a while during rolling deployment of B on many servers both versions coexist

Retrying a job instead of canceling it gives a chance that it lands on a worker that can process it successfully (the developer of the app should instead use different tasks or different queues if tasks are not compatible).

A similar issue exists when a worker receives a task that it does not know about. I decided to retry it for the same reason.

https://github.com/NicolasLM/spinach/blob/3c4d3acc472bec871c6c3945d9f5c8bd47d91b6b/spinach/engine.py#L122-L130

I will check if it's possible in Python to check if a set of (*args, **kwargs) are compatible with a function without actually executing it. This could be added on the client before scheduling the task and would solve some of the problems.

0xDEC0DE commented 5 years ago

I'm left with the impression this is what one is supposed to use:

https://docs.python.org/3/library/inspect.html#introspecting-callables-with-the-signature-object

So something like this in spinach.worker.Workers._worker_func:

from inspect import signature
try:
    signature(job.task_func).bind(*job.task_args, **job.task_kwargs)
except TypeError:
    do_something_clever()
except ValueError:
    # can't inspect the callable, good luck to you!
    pass

...might do the trick.

NicolasLM commented 5 years ago

@0xDEC0DE I pushed a branch that should solve your problem. It does so by checking that arguments are compatible with a task before scheduling, not in the worker like you suggested.

If you still want to be able to cancel a non-compatible job in workers I am thinking to add a signal that could be used to override the default retrying behavior. Let me know if that's necessary or if this commit is enough.

0xDEC0DE commented 5 years ago

This makes a terrific amount of sense, and seems like it should meet our needs almost exactly -- but we'll put it through it's paces and see for sure.

0xDEC0DE commented 5 years ago

Update: indeed, this branch behaves exactly like we would expect it to when given bad inputs. Excellent work!

The sooner this makes it into a release, the happier we'll be.

NicolasLM commented 5 years ago

Great, version 0.0.11 has been released.

NicolasLM / spinach