apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.89k stars 4.63k forks source link

[Bug] [TaskExecutionRunnable] Sub workflow task always repeat run in master-server failover #16767

Open reele opened 2 weeks ago

reele commented 2 weeks ago

Search before asking

What happened

Normally, FailoverCoordinator.getFailoverWorkflowsForMaster() finds all workflows that need failover. When the sub-workflow task's TaskExecutionRunnable.failover() method is called, takeOverTaskFromExecutor() now returns false if the task is a logic task. This results in the creation of a new sub-workflow task instance and the publication of its TaskStartLifecycleEvent, causing the sub-workflow to run again during the failover process.

https://github.com/apache/dolphinscheduler/blob/071994933b05850e7dd5f7bccbf45d867640e244/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/runnable/TaskExecutionRunnable.java#L120-L135

https://github.com/apache/dolphinscheduler/blob/071994933b05850e7dd5f7bccbf45d867640e244/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/runnable/TaskExecutionRunnable.java#L160-L164

so i think it would be better to check the sub-workflow instance properly(by dao or server communicate) and take it over, instead of creating a whole new task instance.

What you expected to happen

Take over sub-workflow task if the sub-workflow instance is in good status.

How to reproduce

execute a workflow with sub-workflow, restart the master-server, query the database

Anything else

No response

Version

dev

Are you willing to submit PR?

Code of Conduct

SbloodyS commented 2 weeks ago

The sub-workflow task is a special task type. During the fault tolerance period, all task types will generate new task instances, which is a unified operation. So this is not a bug.

reele commented 2 weeks ago

The sub-workflow task is a special task type. During the fault tolerance period, all task types will generate new task instances, which is a unified operation. So this is not a bug.

agree that too, perhaps we can make some improvement to avoid task repeating execute in future.

SbloodyS commented 2 weeks ago

agree that too, perhaps we can make some improvement to avoid task repeating execute in future.

Task instances usually run as Linux processes, and the same process id may be occupied by other programs after downtime. So it's difficult to achieve.

reele commented 2 weeks ago

agree that too, perhaps we can make some improvement to avoid task repeating execute in future.

Task instances usually run as Linux processes, and the same process id may be occupied by other programs after downtime. So it's difficult to achieve.

yes, but i mean, the sub-workflow is already be takeover in master's failover process, it just invisible and still running in background, then the father-workflow in failover process, will rerun's the sub-workflow again(by the sub-workflow task), maybe the original workflow and the new workflow are both running in time.

reele commented 1 week ago

@SbloodyS I'm sorry if my previous description was unclear. This issue is mainly about fault tolerance in sub-workflow.

Below is a detailed reproduce.

there is the schedule tree:

MAIN_WORKFLOW
    SUB_WORKFLOW
        STEP1->STEP2->STEP3->STEP4->STEP5

1 2

  1. execute MAIN_WORKFLOW
  2. stop and start master-server 2 times :

3

then the SUB_WORKFLOW executed 3 times, and are all running and be controlled in master:

4

finally, they are all finished:

5

i tried to create a branch to fix it : https://github.com/reele/dolphinscheduler/compare/dev-usable-fix-all...reele:dolphinscheduler:fix-takeover-wf?expand=1

and that works well.

ruanwenjun commented 3 days ago

@reele Hi, the subworkflow take-over logic is under SubWorkflowLogicTask, once we failover a subworkflow task, will generate a new logic task instance, the new task instance will contains the origin SubWorkflowLogicTaskRuntimeContext and then can track the origin sub workflow instance.

I will test this.