elsa-workflows / elsa-core

A .NET workflows library
https://v3.elsaworkflows.io/
MIT License
6.35k stars 1.17k forks source link

Proposal: Parallel Processing of Branches #2872

Closed sfmskywalker closed 1 year ago

sfmskywalker commented 2 years ago

TLDR;

Add a "multi-track" stack implementation so that the activity processor can process multiple branches in an alternating fashion. This enables scenarios where a forked workflow can have multiple branches execute in an alternating fashion without running the risk

Activity processor: the looping construct that processes the scheduled activities stack on the workflow execution context.

Background

The current implementation of the activity processor can only process one branch at a time. For example, if you have a Fork activity with 3 branches, it processed the first branch until a blocking activity is encountered. Then it processes the next branch, and so on.

For most scenarios this is no problem, but there are scenarios where a given branch might continuously push new activities to the stack, never giving other branches a chance to execute.

To fix this, we could implement a new structure for the stack where the stack contains multiple "tracks" - each track representing a branch.

The activity processor would then process one activity per track at a time. I.e. once an activity executed, the processer selects the next track and pops the next activity from there.

This way, multiple branches always get a chance to execute, even when one or more branches are in a tight loop or some other construct causing that branch to "hog" the activity processor.

FransVanEk commented 2 years ago

This comment is also related to the discussion regarding the Join implementation.

As mentioned in this issue, having an alternating branch activity is a good starting point. The current branch execution implementation would continue until it reaches a blocking activity or the end of a branch. It is skipping waiting for merging branches.

The issue of parallel execution is evident. The execution order for the activities is unclear.

Context

So, one activity can connect to multiple activities. This multiplicity is called a fork. In the case of a fork, the expectation is to execute both activities. But, when the two branches merge, the question will be how the execution continues. For example, will the engine run the activity that joins the branches once or multiple times?

Using an explicit 'fork' and 'join' will make it more straightforward for the execution, but it will clutter the process with technical details. However, this is best practice in BPMN, and it makes the intent and the working of the process more straightforward. Furthermore, it will demand more configuration when implementing a process.

The Fork is not a real issue. On a fork, it schedules all related activities for execution. The execution of activities should be like a list of activities due to being executed. The implementation abandons the branch-by-branch execution.

Join Cases

Continue always

Whenever a related activity is finalized, the Join will continue. This behavior could be a default algorithm when the activities are connected directly to an activity.

Wait for any once

When two or more activities merge, the Join will continue when the first related activity is finalized. When the second activity completes, the Join will not continue. Each related activity needs to count the number of times it executes. The Join will only continue when the activity has a higher execution count than any related activity.

Wait for All

When two or more activities merge on a Join, the Join will only continue when all activities have at least the same execution counts. Some could have a higher count. This count is valid on the next iteration.

Complex

This Join has a formula definition to determine when to execute. A formula is not the best practice in process modeling since it doesn't provide visual guidance on the execution logic of the process. Besides, the same result is achieved with multiple Joins or If statements.

Scenario analysis:

A process can hold a signal event. This signal event opens the possibility of having multiple activations in a workflow. When this activation leads to a Fork, the Fork activates numerous activities. To have the process execute these activities in the same context, the process should start a subprocess that acts within that context. Forking and merging with context make more sense in the context of the subprocess instead of trying it to work in just one central process. It avoids the need to keep track of the initiative of the Fork execution to relate activities back together on a join.

FransVanEk commented 2 years ago

as an addition and a slight change to what was proposed above:

In the proposal, counting the number of executions is stated as the responsibility of the activity itself. By doing so, the activity has more than one responsibility. It would make more sense to have the join count the number of activations per connecting activity.

this is in line with the responsibility of the join and limits the coupling within the system.

The activation of the next activity should be handing over something like a token. A token is a carrier of data that can be used to know which activity is handing over the data. Each activity can add additional info to the token by making the token extensible with custom properties. This allows the implementation of a complex Join later on much easier.

the extensibility of the token is not required from the start but making it possible will make the implementation easy later on

mohdali commented 2 years ago

I think the counter approach will face a peoblem when we have loops, because the loop can schedule the joined activity an unknown number of times.

I was thinking that instead of a counter we can keep a flag that monitors the activity execution. With the condition that any activity that gets scheduled needs to clear the flag for all dependent activities. When the dependent activity gets executed it will check again for the flag and skip if already marked as executed.

We might be able to avoid an explicit join by making implicit join always as wait any. While an explicit join forces waiting for all branches.

We also need to consider that each time activity is executed it needs to have a unique id for proper tracing of loop execution. This might not be directly related but could make it clear to differnetiate between the executuon flag and the instance id.

tomy2105 commented 2 years ago

Hi everyone, @sfmskywalker, @mohdali, @FransVanEk, sorry for being late to the party, was listening to weekly recording and heard you discussing about this so decided to speak my 2 cents. If the decisions has already been made and I'm late feel free to disregard my rant :).

In addition I think, in weekly video, I briefly saw a Discord channel I don't have access to with some discussion. If I repeat something from that again, disregard.....

Explicit/Implicit join

First, I'm all for explicit Join and Fork activities (not implicit ones with just linking one activity outcome to multiple ones) for several reasons (and avoiding clutter is the only reason I have for implicit ones):

How many times are activities after join executed?

Second, I've heard discussion about whether an activity after (implicit) join would execute once or multiple times (maybe I understood wrong) and I am strongly opinionated it should be only once (Join logic determines when, but no how many times). As a programmer I can understand some deeply programming mind can expect multiple times but ordinary workflow users don't. So, for a workflow in "figure" below (A - any activity, F - Fork, J - Join of fork with same number)

            | -- A2 -- A3 -- |
A1 -- F1 -- | -- A4 -- A5 -- | -- J1 -- A8
            | -- A6 -- A7 -- |

A8 is executed only once. I've never seen (not saying it doesn't exist) any workflow engine that would do it multiple times. And I can positively say that users of workflow engine(s) my company works with would be freaked out if it happened multiple times. If J1is Join all then it is after all of A2 to A7 are executed. If J1 is Join any then it can be executed as soon as any pair of A2,A3 or A4,A5 or A6, A7 is executed.

What is expected behavior for branch activities after Join any?

Third, the interesting question here for Join any is what happens with the rest of not executed activities in branches when join is executed. Are they allowed to execute (up to Join and in "parallel" to A8 and what is behind it) or not? For example if A4 and A6 were blocking, and A2, A3 finished and J1 finished and A8 continues and after that A4 and A6 unblock, what happens with A5 and A7? Are they executed or not? @sfmskywalker What is the current Elsa 2 behavior for this?

Branched activities execution order

Branches activities execution order is, kind of, "treeish" traversal. It can be Depth first or Breadth first. Depth first is what I understand Elsa 2 is doing, so (with Join all and no blocking activities) for workflow above it would be: A1, F1, A2, A3, A4, A5, A6, A7, J1, A8. It suffers from branch exhaustion possibility if a loop of non blocking activities is on the first branch. Would probably be better if it was breadth first so for workflow above, A1, F1, A2, A4, A6, A3, A5, A7, J1, A8. Or if that is too complex to implement, maybe a simple counter (system wide option to change its max value) that determines maximum number of non blocking activities a workflow can run before Faulting might help (and would help with preventing user error rogue endless loop workflows eating CPU time too).

True parallel execution

Current Elsa 2 implements executing one Activity of a single workflow instance at a time. Although there is a Fork and a Join the branches are not truly parallel (they could be breath first inter-weaved but not truly parallel). I can see many use cases for wanting true parallelism having multiple activities executed at the same time (especially if one has long running activities). Unfortunately doing it opens a Pandora's of box parallel programming/race condition problems like:

To cover all use cases would probably need some kind of lock/mutex activities, workflow instance state (variables and all) divided into parts so each can be stored individually to database, and God know what else :). I think we could all safely say it would very complex undertaking and I think should be postponed for now.

A possible, in the middle solution, for tackling long running activities problem (battle tested in production for many years) would be (simplified) to divide activity execution in 3 parts:

  1. Evaluate input parameters (including all the Javascript with setvariable, getvariable)
  2. Execute actual activity logic
  3. Set output parameters, determine outcome, schedule next activity, store state etc....

At any given time only one of steps 1. and 3. of activity an activity inside a workflow instance is allowed to execute (so saving us the problem of parallelism for setting getting variables, output parameters, db storing, etc...... Steps 2. can be done in own thread(s) in true parallel (if activity is long running and written to do so, many of them are too simple to bother) My 2 cents on true parallelism, hope it helps....

Switch(es) with multiple cases executed

I can fully agree that for Elsa 3 Switch (which has embedded activities for each case) could be useful to enable execution of all cases where condition evaluates to true (with an input property which enables/disables this behavior). Just, maybe, it should be called something else.... Like Conditional Sequence, I know, too long but makes clearer the intent (execute branches whose condition is true in sequence).

However I would leave Flowchart Switch to go to only the first outcome that had true. Otherwise this activity would become a hidden Fork or something and that is probably not the way to go. If you want to continue executing other branches after this one connect explicitly by hand directly, preceded with and Flow Decision with (sigh repeated condition) if needed . To draw a parallel with C# (as someone else did in weekly video) you need an explicit goto in C# switch to go other branch (unless your current branch is empty)..

Ok, if anyone has read this far I thank you very much for your time and patience...... :)

FransVanEk commented 2 years ago

@tomy2105 Thank you so much for the elaborate response on the subject of parallel execution.

Elsa in general:

Elsa provides a workflow execution platform to enable developers to run (long-running) processes. Elsa is an open-source suite of .NET Standard libraries and tools that enable developers to implement long-running workflows. Elsa's core philosophy is to connect small executable units, allowing you to orchestrate business processes such as document approval, customer onboarding, and order fulfillment. Elsa implements a workflow engine within your application's process or as an external workflow service with which your applications interact. Furthermore, workflows support strongly typed classes, JSON/YAML files, or using an HTML5 web-based designer. The construction of Elsa Core supports extensibility by using an interface and a modular approach.

Business processes.

Processes and the process notation vary a lot with standards and best practices. Elsa is not choosing a notation over another or enforcing a best practice. The primary responsibility for Elsa is to provide a consistent, reliable execution of the processes and to support the preference of the implementation of business process analysts. With that in mind, the discussion of the parallel execution is sometimes a bit abstract but mainly focuses on discovering the possible perspectives on the expected behavior of the process execution. Therefore your comment is much appreciated and helps us a lot.

Explicit/Implicit Join

Please take a look at a PoC regarding the execution. It is similar to your suggestions but implemented with a flow context approach. FransVanEk/ElsaParallel (github.com)

How many times are activities after joining executed?

After a Join, execution should be once. For a lot of process implementations, that is a valid use case. Other implementations need different behavior. Again, it is not up to Elsa to decide this; Elsa should be able to support both of them. Having a configurable Join with explicit and clear behavior is mandatory. We have encountered different behaviors, and probably the developers that implemented them would argue differently. For example, having an explicit join for Join all, Join any, Join always lets the process designer decide instead of Elsa Enforcing it.

Branched activities execution order

Indeed this is the core of the discussion. We are trying to steer away from having it as a depth-first branch. Again In the PoC, you can see the proposal of the execution. FransVanEk/ElsaParallel (github.com)

True parallel execution

You described it perfectly. Actual parallel execution comes with a lot of challenges. Many workflow engines don't support it and state that a workflow instance can only be executed on one thread to avoid timing and locking issues.

Switch(es) with multiple cases executed

Fair point, but again it should not be dictated by Elsa. If the process designer wants to have multiple conditions that are valid and, by doing so, create a hidden fork, Elsa should allow it. The concerns you are addressing are valid, and they should be addressed. With Elsa's intention not to require a particular way of process modeling, the necessity to document the behavior is obvious, and making it evident within the designer tool is mandatory. But also, to be able to test the flow early and quickly during the design process is a great way to help people design their processes the way they want them to behave.

tomy2105 commented 2 years ago

@FransVanEk to cut the story short...... My examples were pointed in direction on what I think would be reasonable choice to choose if Elsa cannot afford (due to complexity, developer time, maintainance time, etc...) to have configurable support for all use cases.

And @sfmskywalker has heard enough from me already that he know this. I'd rather see Elsa 3 released sooner with new designer and some other features it already has but without support for fork/join at all than have Elsa 3 later but with "allmighty" fork/join :) :) :).... Let alone my fears that "allmighty" fork/join and all the possibilities will be a nightmare to maintain and will break often with changes in new versions :).

That being said, if we are up to the challenge and Elsa can be made to support all use case, there will be nobody happier than me but I "insist" on extremely verbose configuration support in this case so that both system admin and workflow designer (if system admin allows it) can configure behavior. Just to illustrate. If you have support for both explicit and implicit join I would need configuration that will allow me on system level to disable implicit join, or disable explicit, or disable both (so workflow designer cannot even see one, other or both possibilities). If you introduce so rich pletora of possible behavior on Join again I would need configuration that will allow me on system level to disable each of the behavior types (so workflow designer sees only the ones which are enabled).

Again to cut "my wishes" short as a developer I'm all in for the Elsa can do all train. However this introduces complexity that might not be suitable for workflow designers who are not so developer oriented. That is why I'd need system configuration that would allow me to bring down complexity workflow designer sees by limiting/hiding parts of the functionality Elsa provides.

mohdali commented 2 years ago

For me the main point was that currently Elsa 3 designer allows a single activity to branch to multiple subsequent activities, which I thought didn't have a clear interpretation of the execution sequence, also it wasn't clear what happens when the activities are joined together later.

This indeed sparked the discussion about explicit and implicit joins and scenarios where parallel execution maybe desired (and I wish we have true parallel execution at some point). I think @FransVanEk proposal and the demo provided is really good and helps understand the possible scenarios. I think the context token could resolve many problems.

I too don't want Elsa 3 to be held or delayed because of trying to implement parallel execution. However, I wish we have some roadmap of how it is going to be implemented, probably as Elsa 3.1 or 3.2. Probably not as a full true parallel execution but as async jobs like the background tasks idea that @sfmskywalker presented.

Also, we may need to put some restrictions in the designer regarding what is allowed behavior and what could have unexpected results. Visual indicators and validation is something we need to think about I believe.

sfmskywalker commented 2 years ago

@tomy2105

Third, the interesting question here for Join any is what happens with the rest of not executed activities in branches when join is executed. Are they allowed to execute (up to Join and in "parallel" to A8 and what is behind it) or not? For example if A4 and A6 were blocking, and A2, A3 finished and J1 finished and A8 continues and after that A4 and A6 unblock, what happens with A5 and A7? Are they executed or not? @sfmskywalker What is the current Elsa 2 behavior for this?

In Elsa 2, the Join activity (when it determined it should proceed on to the next activity) will clear any & all blocking activities between the Fork and the Join. This enables scenarios where you have e.g. two branches: one that times out after 5 minutes and one that waits for a user action. When one or the other event occurs, the other event should be canceled.

sfmskywalker commented 2 years ago

@tomy2105

True parallel execution

What you described here makes perfect sense and is in line with what I'm thinking to go with. Right now, we can implement "jobs" that run in the background and can be put on a workflow as activities, but the next step would be to basically configure activities to run asynchronously (i.e. in the background) or synchronously (the default)

This way, we have no issues with concurrent updates or race conditions, while at the same time allowing activities to truly execute in parallel.

sfmskywalker commented 2 years ago

Thanks everyone for the wonderful input provided here!

Everything considered, I decided to implement support for both implicit as well as explicit forks and joins (although I still have an explicit FlowFork activity on my list to add).

The beauty of implicit forks and joins is clear: cleaner workflows. Implicit joins employ the "wait all" strategy. If you need a "wait any" join, one simply has to use the new FlowJoin activity and configure it as such.

I think this brings the best of all worlds together, indeed leaving it up to the process implementor as @FransVanEk described. @FransVanEk I took a lot of your thinking and your awesome experiment to implement the joining logic based on counters - many thanks for this!

@tomy2105 Although I haven't implemented allowing turning off implicit forking, but this should be relatively straightforward to do. Probably something to do for a rainy day.

The result can be found in this PR.

Although this effort doesn't change the implementation from depth-first to breadth-first, this is something I'll look into next. In my opening post of this thread prop, I proposed a "multi-track" stack, but I'm gonna try and see if we could do something even simpler by replacing the stack with a queue.

The stack is what causes the "depth-first" effect, because of its last-in-first-out nature. A queue being a first-in-first-out structure changes this into a breadth-first situation.

I need to do some experiments to see if there are any unexpected effects changing this. In Elsa 2, the queue doesn't work because the engine relies on the depth-first pattern. Elsa 3 on the other hand relies on activities signaling when they are done before the next one gets scheduled, so here a queue might just work straight away. We will see :)

tomy2105 commented 2 years ago

Thank you again for great job :)!

@tomy2105 Although I haven't implemented allowing turning off implicit forking, but this should be relatively straightforward to do. Probably something to do for a rainy day.

:( :( :( :(