camunda / camunda-bpm-platform

Flexible framework for workflow and decision automation with BPMN and DMN. Integration with Quarkus, Spring, Spring Boot, CDI.
https://camunda.com/
Apache License 2.0
4.02k stars 1.53k forks source link

Camunda 7 scripting - race conditions across process-instances while setting a variable #4449

Open DumboJetEngine opened 1 week ago

DumboJetEngine commented 1 week ago

Environment (Required on creation)

( I have customized pretty much nothing on Camunda. It uses the default H2 database, as far as I understand. Isn't that database a valid production candidate? )

Description (Required on creation; please attach any relevant screenshots, stacktraces, log files, etc. to the ticket)

I have a Camunda 7 workflow with a sub-process that accesses parent variables. Specifically, the parent does this in a script task (groovy code):

class Result {
    Boolean canProceed
    String errorMessage
}

Init();

def Init()
{
    def result = execution.hasVariable("result");
    if(result == false)
    {
        result = new Result();
//        execution.removeVariable("result");
        execution.setVariable("result", result);
    }
}

def setError(errorMessage)
{
    Init();
    def result = execution.getVariable("result");
    result.canProceed = false;
    result.errorMessage = errorMessage;
}

def setSuccess()
{
    Init();
    def result = execution.getVariable("result");
    result.canProceed = true;
    result.errorMessage = null;
}

And the sub-process calls the setSuccess() and setError("sth") functions.

The workflow does not contain any user tasks, so once I start a process instance, it executes and it gets done/destroyed, after I get the result variables back.

All is working fine when I call the workflow once at a time. But when I bombard it with parallel calls (each time creating a new process instance), then I get weird errors revolving around variables.

This is the C# code that calls the workflow in parallel (using the latest Camunda.Api.Client nuget package):

var camunda = CamundaClient.Create("http://localhost:8080/engine-rest");
var pd = camunda.ProcessDefinitions.ByKey("name");

var actionNames = new string[] {
    "delete",
    "delete",
    ...
    ...
};

var repetition = 0;
var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = 10 };
await Parallel.ForEachAsync(actionNames, parallelOptions, async (actionName, ct) => {
    var result = await PerformAction(actionName: actionName);
});

async Task<Result> PerformAction(string actionName)
{
    var businessKey = $"temp-id:{Guid.NewGuid()}";

    var camundaResult = await pd.StartProcessInstance(new Camunda.Api.Client.ProcessDefinition.StartProcessInstance
    {
        BusinessKey = businessKey,
        Variables =
        {
            { "action", VariableValue.FromObject(new { name = actionName }) },
        },
        WithVariablesInReturn = true,
    });

    var processResultJson = camundaResult.Variables["result"]?.Value as string;
    var result = JsonSerializer.Deserialize<Result>(processResultJson, new JsonSerializerOptions { PropertyNameCaseInsensitive = true });
    return result;
}

When MaxDegreeOfParallelism is bigger than 1, I get all kinds of unexpected errors, like:

Here are some stack traces (they were too long to post here): https://mega.nz/file/cR0VSLDA#W1f_a4Xxs6OI0hREfAc1SBwx0nzvDrQb3Jn1qPDoGpE

And here is the workflow file: https://mega.nz/file/NM1kDDxK#zsiqoi-7meHYLV4cYl7Qa9CAmQdtyFAW3aMf4cNAKA0

Steps to reproduce (Required on creation)

Observed Behavior (Required on creation)

When using MaxDegreeOfParallelism = 10 various errors related to setting or getting the variables appear, coming from the execution engine.

Expected behavior (Required on creation)

Not getting any error, no matter what degree of parallelism you use, since a process instance is supposed to be isolated from other process instance.

Root Cause (Required on prioritization)

Solution Ideas

Hints

Links

Breakdown

### Pull Requests

Dev2QA handover

yanavasileva commented 3 days ago

Hi @DumboJetEngine,

Thank you for your interest in our product.

add the Camunda.Api.Client nuget package to the project

  1. Is this Camunda client created based on our OpenAPI? I found https://www.nuget.org/packages/Camunda.Api.Client that doesn't seem to have been updated since 2020 and I don't think it is compatible with Camunda 7.21. Further, we don't provide support the clients created by third-party tools.

  2. Would it be possible to simplify the project without the usage of NuGet client and share an end-to-end minimal example that reproduces the issue? For that you can consider using: https://github.com/camunda/camunda-engine-unittest template.

  3. Could you try to run your scenario with enabled asynchronous continuation on Activity_Initialize task and observe if the errors still persist? Screenshot: image

In case you need to upload more data relevant to the investigation of the issue, please create a simple repository or upload files as gist in GitHub. Thank you in advance for that.

Best, Yana

DumboJetEngine commented 3 days ago

Hello.

  1. The client supposedly supports Camunda 7 (see here). I am not sure if any breaking changes were added in version 7.2* of Camunda. I have tried looking into your API to see if any fields the API accepts were missing from the calls (mainly here), but I saw no field relevant enough to cause a race condition. And everything works fine when there is no concurrency.
  2. It's been a while since I last touched Java, so I don't think it will be very feasible for me to use this unit test template. :( Downloading all the tools to build this template project and replicating the logic might be doable after some struggle, but I have no idea how to execute things in parallel in Java.
  3. When using asynchronous continuations on the "Initialize" block, nothing changes. It still works with one thread, and fails the same way with 10 threads. However, I get no result variable back with this enabled. I am not sure why this is. I am new to Camunda, and I had the impression that this option only persists the current state and that it should not affect the workflow result-variables in any way if no errors occur.

I might try to hit your API without this client, to see if that changes anything, but I honestly don't think it will.

yanavasileva commented 3 days ago

I might try to hit your API without this client, to see if that changes anything, but I honestly don't think it will.

That might be the case, I just wanted to lay out all of the options. If you manage to create a standalone reproducible example will speed up reproducing the bug and its analysis.

DumboJetEngine commented 6 hours ago

Hello again. I've used a simple HTTP client to reproduce the problem this time: https://gist.github.com/DumboJetEngine/7bcdeccc222d4339fe70bc008f56f652 Test with MaxDegreeOfParallelism = 1 and MaxDegreeOfParallelism = 10 to see the difference.

Here is the bpmn file I've used: https://gist.github.com/DumboJetEngine/4fd2efb3462a879f210afc6636916069

I don't see asynchronous continuations affecting anything (when it comes to errors at least).