apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.88k stars 4.63k forks source link

[Bug] [master] dynamic task may cause duplicated subprocess execution #16437

Open ChaoquanTao opened 3 months ago

ChaoquanTao commented 3 months ago

Search before asking

What happened

dynamic task polls subprocess status and select wait_to_run ones wrap as commands, and then insert them into db for every 10 second. In the same time the MasterSchedulerBootstrap periodically selects these commands to start workflows. In one master scenario, if master is overload for over 10 seconds, the comamnds to be executed will be accumulated, cause repeated execution for the same subprocess.

 if (isOverload) {
                    log.warn("The current server is overload, cannot consumes commands.");
                    MasterServerMetrics.incMasterOverload();
                    Thread.sleep(Constants.SLEEP_TIME_MILLIS);
                    continue;
                }

Myabe in MasterSchedulerBootstrap, if processInstanceExecCacheManager.contains(processInstance.getId(), we should just return?

 if (processInstanceExecCacheManager.contains(processInstance.getId())) {
         log.error(
                 "The workflow instance is already been cached, this case shouldn't be happened");
}

or to minimize the effect of modification, we could fix like this?

if (processInstanceExecCacheManager.contains(processInstance.getId()) && CommandType.DYNAMIC_GENERATION.equals(command.getCommandType())) {
                                    log.error(
                                            "The workflow instance is already been cached, this case shouldn't be happened, id: {}", processInstance.getId());
                                    return;
                                }

What you expected to happen

dynamic task doesn't reexecute subprocess

How to reproduce

start up a dynamic task, try to make master be overload for over 10 seconds, observe the command to be executed

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

Code of Conduct