botify-labs / simpleflow

Python library for dataflow programming.
https://botify-labs.github.com/simpleflow/
MIT License
68 stars 24 forks source link

Better handling of OPEN_ACTIVITIES_LIMIT_EXCEEDED #68

Open jbbarth opened 8 years ago

jbbarth commented 8 years ago

There's a limit on SWF, you cannot have more than 1000 tasks open (== scheduled or started). simpleflow already has a protection for not scheduling too many tasks, it lives here: https://github.com/botify-labs/simpleflow/blob/master/simpleflow/swf/executor.py#L264-L267, but it seems it doesn't work very well.

As it doesn't work, activity tasks are constantly rescheduled even if SWF says "no". In the latest workflow where botify had this limit reached, things became bad around event 9100, and simpleflow continued to send ScheduleActivityTask decisions. At some point the workflow reached 25k events and it broke with:

    {
        "eventId": 25044,
        "eventTimestamp": 1446958847.47,
        "eventType": "WorkflowExecutionTerminated",
        "workflowExecutionTerminatedEventAttributes": {
            "cause": "EVENT_LIMIT_EXCEEDED",
            "childPolicy": "TERMINATE"
        }
    }

Multiple options to solve this:

Option 1 and 2 are normal investigations.

The option 4 (child workflows) will be explored eventually in the next few weeks/months.

We may also explore option 6 immediately because it should be easy to implement and would avoid problems down the road. Constantly flirting with the limits is not a good idea in practice.

Option 5 may be discussed internally, not releveant to simpleflow interests.

jbbarth commented 8 years ago

OK, here's what happened in our workflow (simplified):

from simpleflow import Workflow

class OurWorkflow(Workflow):
    def run(self, **context):
        # a first activity that needs preparation (partitionning for instance)
        preparation = self.submit(prepare_expensive_activity)
        if preparation.finished:
            for i in range(0,400):
                self.submit(foo)

        # many other activities
        for i in range(0, 800):
            self.submit(bar)

        # wait for everybody to finish..
        # ...

Now what happens when playing this workflow from the decider point of view:

Moral of the story: we cannot rely on the open activity tasks counter unless we parse the whole workflow. We'd have to defer tasks submission after we know exactly all tasks and their status.

Even then, some constructs could make simpleflow fail very easily, for instance if we bypass a conditional where we entered before and where one or many tasks are submitted.

=> option 1 looks a bit complex and we won't be able to implement a fully reliable solution if people mess with future.finished conditionals.

jbbarth commented 8 years ago

Instead of relying on a naive counter, I think we could rely on an array of open activities (probably just their "activityId"), so we don't start from zero but from this list. When replaying the workflow:

This will protect against cases like the one described above (if done correctly). Now to plumb that together unless somebody has a better idea.

jbbarth commented 8 years ago

Option "6" (add a soft limit option) was added in https://github.com/botify-labs/simpleflow/commit/fd22d8d1b1e630e0251eab89a606fb730c6c25f6 and is available for simpleflow >= 0.10.2.