Open reele opened 2 weeks ago
The sub-workflow
task is a special task type. During the fault tolerance period, all task types will generate new task instances, which is a unified operation. So this is not a bug.
The
sub-workflow
task is a special task type. During the fault tolerance period, all task types will generate new task instances, which is a unified operation. So this is not a bug.
agree that too, perhaps we can make some improvement to avoid task repeating execute in future.
agree that too, perhaps we can make some improvement to avoid task repeating execute in future.
Task instances usually run as Linux processes, and the same process id may be occupied by other programs after downtime. So it's difficult to achieve.
agree that too, perhaps we can make some improvement to avoid task repeating execute in future.
Task instances usually run as Linux processes, and the same process id may be occupied by other programs after downtime. So it's difficult to achieve.
yes, but i mean, the sub-workflow is already be takeover in master's failover process, it just invisible and still running in background, then the father-workflow in failover process, will rerun's the sub-workflow again(by the sub-workflow task), maybe the original workflow and the new workflow are both running in time.
@SbloodyS I'm sorry if my previous description was unclear. This issue is mainly about fault tolerance in sub-workflow.
Below is a detailed reproduce.
there is the schedule tree:
MAIN_WORKFLOW
SUB_WORKFLOW
STEP1->STEP2->STEP3->STEP4->STEP5
then the SUB_WORKFLOW executed 3 times, and are all running and be controlled in master:
finally, they are all finished:
i tried to create a branch to fix it : https://github.com/reele/dolphinscheduler/compare/dev-usable-fix-all...reele:dolphinscheduler:fix-takeover-wf?expand=1
and that works well.
@reele Hi, the subworkflow take-over logic is under SubWorkflowLogicTask
, once we failover a subworkflow task, will generate a new logic task instance, the new task instance will contains the origin SubWorkflowLogicTaskRuntimeContext
and then can track the origin sub workflow instance.
I will test this.
Search before asking
What happened
Normally,
FailoverCoordinator.getFailoverWorkflowsForMaster()
finds all workflows that need failover. When the sub-workflow task'sTaskExecutionRunnable.failover()
method is called,takeOverTaskFromExecutor()
now returnsfalse
if the task is a logic task. This results in the creation of a new sub-workflow task instance and the publication of itsTaskStartLifecycleEvent
, causing the sub-workflow to run again during the failover process.https://github.com/apache/dolphinscheduler/blob/071994933b05850e7dd5f7bccbf45d867640e244/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/runnable/TaskExecutionRunnable.java#L120-L135
https://github.com/apache/dolphinscheduler/blob/071994933b05850e7dd5f7bccbf45d867640e244/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/runnable/TaskExecutionRunnable.java#L160-L164
so i think it would be better to check the sub-workflow instance properly(by dao or server communicate) and take it over, instead of creating a whole new task instance.
What you expected to happen
Take over sub-workflow task if the sub-workflow instance is in good status.
How to reproduce
execute a workflow with sub-workflow, restart the master-server, query the database
Anything else
No response
Version
dev
Are you willing to submit PR?
Code of Conduct