apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.68k stars 4.58k forks source link

[Bug] [Process Instance ] Version 3.2.2 Workflow scheduling uses a serial wait for execution strategy, and task blocking accumulation occurs after master overloads #16474

Open MaskedMenhxy opened 1 month ago

MaskedMenhxy commented 1 month ago

Search before asking

What happened

The phenomenon is as follows: image1

This is logs 企业微信截图_17237765455443

What you expected to happen

I want to be able to continue executing tasks normally or discard old tasks that are waiting in serial

How to reproduce

Troubleshooting steps:

  1. The workflow instance uses the execution strategy of serial waiting
  2. The current master server is overloaded
  3. Make the current master is not in active master list

Anything else

My analysis: The master node may be overloaded, resulting in the inactivation of the master node. In this case, The generated workflow instance fails to be updated from “wait by serial_wait strategy ” to “submit from serial_wait strategy”. Then the status value of the workflow instance in the database stays at "wait by serial_wait strategy", and the next scheduled workflow instance, Before updating himself from “wait by serial_wait strategy ” to “submit from serial_wait strategy”, A workflow instance whose id is smaller than its own is in the “wait by serial_wait strategy” state. Procedure The “submit from serial_wait strategy ” status is not updated. Because of this, all future workflow instances will stay in the "wait by serial_wait strategy" state, resulting in task stacking.

The relevant code is as follows: org.apache.dolphinscheduler.service.process.ProcessServiceImpl#saveSerialProcess image

image

Version

3.2.x

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 1 month ago

Search before asking

What happened

The phenomenon is as follows: image1

This is logs 企业微信截图_17237765455443

What you expected to happen

I want to be able to continue executing tasks normally or discard old tasks that are waiting in serial

How to reproduce

Troubleshooting steps:

  1. The workflow instance uses the execution strategy of serial waiting
  2. The current master server is overloaded
  3. Make the current master is not in active master list

Anything else

My analysis: The master node may be overloaded, resulting in the inactivation of the master node. In this case, The generated workflow instance fails to be updated from “wait by serial_wait strategy ” to “submit from serial_wait strategy”. Then the status value of the workflow instance in the database stays at "wait by serial_wait strategy", and the next scheduled workflow instance, Before updating himself from “wait by serial_wait strategy ” to “submit from serial_wait strategy”, A workflow instance whose id is smaller than its own is in the “wait by serial_wait strategy” state. Procedure The “submit from serial_wait strategy ” status is not updated. Because of this, all future workflow instances will stay in the "wait by serial_wait strategy" state, resulting in task stacking.

The relevant code is as follows: org.apache.dolphinscheduler.service.process.ProcessServiceImpl#saveSerialProcess image

image

Version

3.2.x

Are you willing to submit PR?

Code of Conduct

SbloodyS commented 1 month ago

This will be fix in #16327 .

MaskedMenhxy commented 1 month ago

@SbloodyS Thanks for your help!