Open jbbarth opened 8 years ago
OK, here's what happened in our workflow (simplified):
from simpleflow import Workflow
class OurWorkflow(Workflow):
def run(self, **context):
# a first activity that needs preparation (partitionning for instance)
preparation = self.submit(prepare_expensive_activity)
if preparation.finished:
for i in range(0,400):
self.submit(foo)
# many other activities
for i in range(0, 800):
self.submit(bar)
# wait for everybody to finish..
# ...
Now what happens when playing this workflow from the decider point of view:
ScheduleActivityTask
(in fact multiple decisions because of the 100-decision-tasks limit, but anyway)Moral of the story: we cannot rely on the open activity tasks counter unless we parse the whole workflow. We'd have to defer tasks submission after we know exactly all tasks and their status.
Even then, some constructs could make simpleflow fail very easily, for instance if we bypass a conditional where we entered before and where one or many tasks are submitted.
=> option 1 looks a bit complex and we won't be able to implement a fully reliable solution if people mess with future.finished
conditionals.
Instead of relying on a naive counter, I think we could rely on an array of open activities (probably just their "activityId"), so we don't start from zero but from this list. When replaying the workflow:
This will protect against cases like the one described above (if done correctly). Now to plumb that together unless somebody has a better idea.
Option "6" (add a soft limit option) was added in https://github.com/botify-labs/simpleflow/commit/fd22d8d1b1e630e0251eab89a606fb730c6c25f6 and is available for simpleflow >= 0.10.2.
There's a limit on SWF, you cannot have more than 1000 tasks open (== scheduled or started). simpleflow already has a protection for not scheduling too many tasks, it lives here: https://github.com/botify-labs/simpleflow/blob/master/simpleflow/swf/executor.py#L264-L267, but it seems it doesn't work very well.
As it doesn't work, activity tasks are constantly rescheduled even if SWF says "no". In the latest workflow where botify had this limit reached, things became bad around event 9100, and simpleflow continued to send
ScheduleActivityTask
decisions. At some point the workflow reached 25k events and it broke with:Multiple options to solve this:
ScheduleActivityTaskFailed
event withcause = OPEN_ACTIVITIES_LIMIT_EXCEEDED
ScheduleToStart
andScheduleToClose
clocks start running earlier, and you know that with your current platform you cannot honnor those tasks.Option 1 and 2 are normal investigations.
The option 4 (child workflows) will be explored eventually in the next few weeks/months.
We may also explore option 6 immediately because it should be easy to implement and would avoid problems down the road. Constantly flirting with the limits is not a good idea in practice.
Option 5 may be discussed internally, not releveant to simpleflow interests.