botify-labs / simpleflow

Python library for dataflow programming.
https://botify-labs.github.com/simpleflow/
MIT License
68 stars 24 forks source link

Custom logic on retries #242

Open jbbarth opened 7 years ago

jbbarth commented 7 years ago

When a task fail, as of today, we have a simple "retry" counter to allow it to be retried.

The first shortcoming in a cloud environment is we can (relatively) often lose machines, so we'd like to have more retries in case a "heartbeat timeout" happens. This could be dealt with a second retry counter for heartbeat timeouts, different from the first one. Note that this is not perfect at all, because heartbeat timeouts happen 1/ when we lose a process/machine, but also 2/ when an OOM occurs. In most cases we actually lose the process because the oom-killer (linux) kills the faulty process and python doesn't get a chance to raise an OSError we could catch. There's #239 for that fwiw.

A second shortcoming is that all errors are not the same:

In order to solve all this, I propose that either workflows or activity tasks provide a method/callable that can be called when an error occurs, so that we can get more control on the retry behaviour. I'm not sure if it should be at the workflow or activity task level for now, @ybastide @ampelmann your input is welcome on that.

If I stick with the workflow for now, the API could typically be something like:

class MyWorkflow(Workflow):
    def run(...):
        # interesting things here

    def should_retry_activity(task, previous_tasks = None):
        # we could also pass the future but then it would be nice to have
        # access to the event directly in the future (?)
        if task["state"] == "timed_out":
            # ...
        elif task["state"] == "failed":
            if task["exception_class"] == "SocketError" and "reset by peer" in task["exception_message"]:
                return True
            if task["exception_class"] == "OperationalError" and "could not connect to server" in task["exception_message"]:
                # find a way to delay the execution? for now retry
                return True
    end
end

You see the idea. A few notes:

After writing all this I understand there can be a looot of ideas and weird things we could think about. So maybe stay minimal in a first step.

How do you see this @ybastide @ampelmann ? Maybe discuss this in front of a whiteboard later ?

ybastide commented 7 years ago

First, a generic :+1: on this: we need a fine-grained retrying framework.

Poring over the details: