Workflow Execution Recovery

sfmskywalker commented 9 months ago

Overview: In the event of an application crash, currently running workflow executions are lost. There's a critical need for a reliable method to restart these abruptly interrupted workflows. This feature introduces a robust solution to address this challenge.

Key Mechanism: The cornerstone of this feature is the utilisation of the workflow instance's Status field. Under normal operations, a workflow instance transitions through various states like "Finished", "Suspended", or "Faulted". However, if an instance is unexpectedly terminated due to an application crash, its Status remains as "Running". This state indicates that the workflow was active at the time of the crash and did not conclude naturally.

Terminology: To ensure clarity and avoid confusion with existing processes, we introduce the term "Restart" for this feature. This term is distinct from "Resume", which is already used for restarting suspended workflows. "Restart" specifically refers to the process of restarting workflows that were actively running and got interrupted due to an application crash.

Restarting Methods: This feature encompasses two primary methods for restarting interrupted workflows:

Alteration-Based Recovery:
- This method allows for the manual initiation of the recovery process.
- It involves altering specific parameters or settings to trigger the restart of the interrupted workflow. For example, additional input and whether to run the workflow synchronously or asynchronously.
- This option provides control and flexibility, particularly useful in scenarios where selective recovery is needed.
Automatic Recovery During Application Startup:
- This is an automated approach designed to streamline the recovery process.
- Upon application restart, the system automatically scans for workflows with a "Running" status but were actually halted due to the crash.
- These identified workflows are then automatically recovered, ensuring minimal disruption and swift continuation of business processes.

Conclusion: This feature is a significant step towards enhancing the resilience and reliability of our workflow management. By accurately identifying and efficiently restarting interrupted workflows, we ensure continuity and reduce the impact of unexpected application crashes.

Tasks

[ ] #4831
[ ] #4832
[ ] #4835

bbenameur commented 9 months ago

Hello @sfmskywalker, I agree with what you are proposing, in fact from what I do on tests, I think that the resumption of suspended workflows due to a service restart or service crash for example is not possibl with what exist on main branche and v3, because before the end of workflow or orchestration there are nothing that is saved in the database (I tested with mongodb and Sql Server), i mean the payload and all the workflow sent will only be available in the database only after the workflow has finished (Finished or failed...).

Example: On the sample : https://github.com/elsa-workflows/elsa-core/blob/main/src/bundles/Elsa.Server.Web/Endpoints/DynamicWorkflows/Post/Endpoint.cs


    public override async Task HandleAsync(CancellationToken ct)
    {
        var workflow = new Workflow
        {
            Identity = new WorkflowIdentity("DynamicWorkflow1", 1, "DynamicWorkflow1:v1"),
            Root = new Sequence
            {
                Activities =
                {
                    new OneActivity
                    {
                        ServiceName = new Input<string>("ServiceOne"),
                    },
                    new OneActivity
                    {
                        ServiceName = new Input<string>("ServiceTwo"),
                    }
                }
            }
        };

        await workflowRegistry.RegisterAsync(workflow, ct);
        await workflowRuntime.StartWorkflowAsync("DynamicWorkflow1", new StartWorkflowRuntimeOptions());
    }
}

public class OneActivity : CodeActivity
{
    public required Input<string> ServiceName { get; set; }

    protected override async ValueTask ExecuteAsync(ActivityExecutionContext context)
    {
        var serviceName = ServiceName.Get(context);
    }
}

Juste put a breakpoint at the OneActivity and stop debug at Activity call 2. When checking database collection there ara no data related to our orchestration, no payload ..... I was tested with many implementations

sfmskywalker commented 9 months ago

Thanks for the input @bbenameur , we can use your test case to verify the feature proposed here 👍🏻

hsnsalhi commented 8 months ago

Hello @sfmskywalker, I noticed that this feature has been omitted in Elsa 3.1, which appears to be quite critical for our requirements. Could you please inform me if there's an anticipated timeline for its reintroduction ? Thank you

sfmskywalker commented 8 months ago

Hi @hsnsalhi , Indeed, unfortunately we will not be able to include this capability on time for the 3.1 release which is slated for this month. It will be picked up shortly thereafter, which means it will be included with 3.2, which will be released in June. And, as always, the feature will be part of the normal preview builds once it's available. Sorry for the delay on this one.

rosca-sabina commented 4 months ago

Hi! Is there any ETA for this feature? I see it's been moved to the 3.3 milestone.

sfmskywalker commented 4 months ago

Hi @rosca-sabina , Unfortunately this feature has been pushed down again due to other priorities. It's unknown at this point when this can be picked up.

edward-yuen-tfs commented 3 months ago

If you do it on 3.4, will it be done by end of year?

sfmskywalker commented 1 month ago

It depends on the situation. Features driven by customer requests are typically implemented quickly, while other features are developed more organically, making them harder to plan for.

elsa-workflows / elsa-core

Workflow Execution Recovery #4833