apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.88k stars 4.63k forks source link

[Bug] [MASTER] When Serial Wait workflow are frequently scheduled, serially waiting workflow might get stuck #16369

Closed CloudSen closed 2 weeks ago

CloudSen commented 3 months ago

Search before asking

What happened

I think the state synchronization mechanism has critical errors. 状态的转换是一个过程,但是现在流程之间的通知机制依赖瞬时状态。 State transition is a process, but currently, the notification mechanism between processes relies on instantaneous states. 在此,举一个例子,如果工作流也有超时配置,这个例子里,工作流超时后,下一个串行等待的工作流可能会卡住。 For example, let's imagine if the workflows also had timeout configurations,in this scenario, if a workflow times out, the next serially waiting workflow might get stuck.

@startuml
StateWheelExecuteThread --> WorkflowTimeoutStateEventHandler: send process A PROCESS_TIMEOUT event
WorkflowTimeoutStateEventHandler -> WorkflowExecuteRunnable: processTimeout
WorkflowExecuteRunnable --> WorkflowStateEventHandler: send process A STOP event
WorkflowStateEventHandler -> WorkflowExecuteRunnable: endProcess
WorkflowExecuteRunnable -> WorkflowExecuteRunnable:checkSerialProcess
WorkflowExecuteRunnable --> ProcessServiceImpl: send process B RECOVER_SERIAL_WAIT command
ProcessServiceImpl -> ProcessServiceImpl: STEP 1: handleCommand(if state of process A is RUNNING_PROCESS_STATE, state of proces B will change back to SERIAL_WAIT)
WorkflowStateEventHandler -> WorkflowStateEventHandler: STEP 2: update process A to STOP
@enduml

image

由于STEP 1和STEP 2的顺序无法保证,会导致后续所有实例都卡在“串行等待”状态。 Since the order of STEP 1 and STEP 2 cannot be guaranteed now, all subsequent instances might get stuck in the "serial wait" state.

What you expected to happen

恢复后续实例前,需要保证自己的状态更新完毕 Before resuming subsequent instances, you need to ensure that your own state has been fully updated.

How to reproduce

Anything else

No response

Version

3.1.1

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 3 months ago

Search before asking

What happened

State transition is a process, but now the notification mechanism between processes relies on transient state. State transition is a process, but currently, the notification mechanism between processes relies on instantaneous states. Here, take an example, if the workflow also has a timeout configuration, in this example, after the workflow times out, the next serially waiting workflow may be stuck. For example, let's imagine if the workflows also had timeout configurations, in this scenario, if a workflow times out, the next serially waiting workflow might get stuck.

@startuml
StateWheelExecuteThread --> WorkflowTimeoutStateEventHandler: send process A PROCESS_TIMEOUT event
WorkflowTimeoutStateEventHandler -> WorkflowExecuteRunnable: processTimeout
WorkflowExecuteRunnable --> WorkflowStateEventHandler: send process A STOP event
WorkflowStateEventHandler -> WorkflowExecuteRunnable: endProcess
WorkflowExecuteRunnable -> WorkflowExecuteRunnable:checkSerialProcess
WorkflowExecuteRunnable --> ProcessServiceImpl: send process B RECOVER_SERIAL_WAIT command
ProcessServiceImpl -> ProcessServiceImpl: STEP 1: handleCommand(if state of process A is RUNNING_PROCESS_STATE, state of process B will change back to SERIAL_WAIT)
WorkflowStateEventHandler -> WorkflowStateEventHandler: STEP 2: update process A to STOP
@enduml

image

Since the order of STEP 1 and STEP 2 cannot be guaranteed, all subsequent instances will be stuck in the "serial waiting" state. Since the order of STEP 1 and STEP 2 cannot be guaranteed now, all subsequent instances might get stuck in the "serial wait" state.

image

What you expected to happen

Before restoring subsequent instances, you need to ensure that your status has been updated. Before resuming subsequent instances, you need to ensure that your own state has been fully updated.

How to reproduce

Anything else

No response

Version

dev

Are you willing to submit PR?

Code of Conduct

SbloodyS commented 3 months ago

cc @ruanwenjun

CloudSen commented 3 months ago

Maybe we need a cache queue for the RECOVER_SERIAL_WAIT command? If the command execution finds that the previous workflow has not been updated to a completed state in time, it should be put back in the queue and checked again after a while. When the previous workflow state has been fully updated, this command will successfully consume from the queue and transition from the SERIAL_WAIT state to the SUBMITTED_SUCCESS state.

ZhaoRuidong commented 3 months ago

+1

CloudSen commented 3 months ago

Currently, I resolved this issue by retrying the RECOVER_SERIAL_WAIT command when the previous workflow was not fully updated, and it works well.

ruanwenjun commented 3 months ago

Right now the serial wait implementation is really unstable, there are a lot of case will cause it doesn't work well. e.g.

  1. Concurrent trigger will cause multiple workflow instance running which should in serial wait.
  2. Notify failed might cause the origin workflow instance cannot finish.
  3. The workflow should deal with the notify logic, this make the workflow instance state transition more complex.

It's better to refactor this, use a global SerialWaitCoordinator to notify the serial wait workflow instance, the origin workflow instance don't need to care whether it need to do notification.

CloudSen commented 3 months ago

Right now the serial wait implementation is really unstable, there are a lot of case will cause it doesn't work well. e.g.

  1. Concurrent trigger will cause multiple workflow instance running which should in serial wait.
  2. Notify failed might cause the origin workflow instance cannot finish.
  3. The workflow should deal with the notify logic, this make the workflow instance state transition more complex.

It's better to refactor this, use a global SerialWaitCoordinator to notify the serial wait workflow instance, the origin workflow instance don't need to care whether it need to do notification.

Agree with your opinion, I have thought about a coordinator similar to SerialWaitCoordinator instead of notifying tasks themselves.

Retrying the RECOVER_SERIAL_WAIT command can only resolve instances where the next workflow state is already SERIAL_WAIT. However, it does not address cases where the next workflow instance is transitioning into a SERIAL_WAIT state.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Gallardot commented 1 month ago

In version 3.2.1, I fixed a similar issue, which might be helpful for the current issue. You can refer to it. @CloudSen

15270

CloudSen commented 1 month ago

@Gallardot Yes, I’ve already applied patch #15270. But as I mentioned earlier, the state change is always a process. Opening a new transaction only speeds up the state update, but the same issue still exists in concurrent scenarios. A synchronizer is still needed for serially running workflows.

CloudSen commented 1 month ago

@Gallardot Here’s another example for saveSerialProcess: In concurrent scenarios (multiple parent workflows with the same scheduling cycle reference the same sub-workflow), multiple start_workflow commands will be consumed by different masters at the same time. Then, in saveSerialProcess, each will detect that there are no running instances simultaneously(transaction have not been committed yet), and multiple start events for the workflow will be triggered at the same time. This ultimately leads to multiple workflow instances running concurrently, which is incorrect when they are supposed to be processed serially.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] commented 2 weeks ago

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.