apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.73k stars 4.58k forks source link

[Question] [Master] Complement process instances executed in parallel influence each other #16340

Open starrysxy opened 2 months ago

starrysxy commented 2 months ago

Search before asking

What happened

When I run a workflow in parallel using the complement mode, I can't stop the process instances. If I do this, a few process instances will not be scheduled besides the instance I stop.

e.g.: scroll down and refer to How to reproduce please

I have check the code, process instances executed in parallel will be divided into queues according to the degree of parallelism. But, there is something wrong in org.apache.dolphinscheduler.server.master.event.WorkflowStateEventHandler#handleStateEvent method.

When I stop one process instance, I get these logs:

songxingyin@songxinyindeMBP logs % cat dolphinscheduler-standalone.2024-07-17_23.0.log | grep 'Handle workflow instance state event, the current workflow instance state WorkflowExecutionStatus' | grep 'stop'
[INFO] 2024-07-17 23:33:35.736 +0800 o.a.d.s.m.e.WorkflowStateEventHandler:[43] - Handle workflow instance state event, the current workflow instance state WorkflowExecutionStatus{code=4, desc='ready stop'} will be changed to WorkflowExecutionStatus{code=4, desc='ready stop'}
[INFO] 2024-07-17 23:33:38.809 +0800 o.a.d.s.m.e.WorkflowStateEventHandler:[43] - Handle workflow instance state event, the current workflow instance state WorkflowExecutionStatus{code=5, desc='stop'} will be changed to WorkflowExecutionStatus{code=5, desc='stop'}

The initial state and target state are both {code=4, desc='ready stop'} or {code=5, desc='stop'}.

I know the next complement command will be created in org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteRunnable#processComplementData method.

When {code=4, desc='ready stop'}, the processComplementData() method will return false before create next complement command. And, when {code=5, desc='stop'}, the processComplementData() method will not be called, so there is also no next complement command.

@Override
public boolean handleStateEvent(WorkflowExecuteRunnable workflowExecuteRunnable,
                                StateEvent stateEvent) throws StateEventHandleException {
    WorkflowStateEvent workflowStateEvent = (WorkflowStateEvent) stateEvent;
    ProcessInstance processInstance =
            workflowExecuteRunnable.getWorkflowExecuteContext().getWorkflowInstance();
    ProcessDefinition processDefinition = processInstance.getProcessDefinition();
    measureProcessState(workflowStateEvent, processInstance.getProcessDefinitionCode().toString());

    log.info(
            "Handle workflow instance state event, the current workflow instance state {} will be changed to {}",
            processInstance.getState(), workflowStateEvent.getStatus());

    if (workflowStateEvent.getStatus().isStop()) {
        // serial wait execution type needs to wake up the waiting process
        if (processDefinition.getExecutionType().typeIsSerialWait() || processDefinition.getExecutionType()
                .typeIsSerialPriority()) {
            workflowExecuteRunnable.endProcess();
            return true;
        }
        workflowExecuteRunnable.updateProcessInstanceState(workflowStateEvent);
        return true;
    }
    if (workflowExecuteRunnable.processComplementData()) {
        return true;
    }
    if (workflowStateEvent.getStatus().isFinished()) {
        ...
    }

    if (workflowStateEvent.getStatus().isReadyStop()) {
        ...
    }
    return true;
}
public boolean processComplementData() {
    ProcessInstance workflowInstance = workflowExecuteContext.getWorkflowInstance();
    if (!needComplementProcess()) {
        return false;
    }

    // when the serial complement is executed, the next complement instance is created,
    // and this method does not need to be executed when the parallel complement is used.
    if (workflowInstance.getState().isReadyStop() || !workflowInstance.getState().isFinished()) {
        return false;
    }
    ...
    return true;
}

What you expected to happen

When I stop one complement process instance executed in parallel, other instances will not be influenced.

How to reproduce

  1. Click 'Start' button, and set the config like the following picture image
  2. There will be 5 process instances running, and I stop one of them. In this cycle, when I stop the instance, everything looks good. But in the next cycle, there are only 4 process instances running (In the following picture, A is the first cycle, and B is the second cycle). And If I stop another one in the later cycle, there will be only 3 process instances running in the next cycle. And so on. image

Anything else

In my opinion, this is a bug, complement process instances executed in parallel should not influence each other.

If this is a bug, I am willing to try to fix it.

Version

dev

Are you willing to submit PR?

Code of Conduct

SbloodyS commented 2 months ago

This is not a bug. If you don't feel you need to make up on a certain day, you can choose to skip that date.

starrysxy commented 2 months ago

This is not a bug. If you don't feel you need to make up on a certain day, you can choose to skip that date.

I get you, but the real situation is more complex. Because, sometimes, I can't foresee which day I need to skip.

Consider this situation: After I submitted a task to make up from June 1st to June 30th in parallel mode, some other people told me there is something wrong with the data on June 10th, meanwhile, other data has no problem. Unfortunately, it will influence my compliment instance on June 10th. So I just want to stop the instance on June 10th.

But now after I stopped the instance on June 10, some other instances were not scheduled. I need to find out which dates' instances were not scheduled, and then re-run these unscheduled instances.

So I think this is a bug, because complement process instances executed in parallel should not influence each other.

starrysxy commented 2 months ago

This is not a bug. If you don't feel you need to make up on a certain day, you can choose to skip that date.

I get you, but the real situation is more complex. Because, sometimes, I can't foresee which day I need to skip.

Consider this situation: After I submitted a task to make up from June 1st to June 30th in parallel mode, some other people told me there is something wrong with the data on June 10th, meanwhile, other data has no problem. Unfortunately, it will influence my compliment instance on June 10th. So I just want to stop the instance on June 10th.

But now after I stopped the instance on June 10, some other instances were not scheduled. I need to find out which dates' instances were not scheduled, and then re-run these unscheduled instances.

So I think this is a bug, because complement process instances executed in parallel should not influence each other.

@SbloodyS I still think this is a bug. Maybe this scenario is not the best practice. But when someone do like this, they will be confused by the missing process instance.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.