apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.73k stars 4.58k forks source link

[DSIP-69] Fix master dispatch task timeout might cause task duplicate running in worker #16481

Open ruanwenjun opened 1 month ago

ruanwenjun commented 1 month ago

Search before asking

Motivation

Right now, there exist some case might cause the task duplicated dispatched. e.g.

image

The master dispatch task a to worker A first, but receive a timeout response, this might happen when the worker rpc is busy, then master will select a new worker B and retry the dispatch.

Then there might exist two situations:

  1. The task has been received by worker A, then take will duplicate exist in worker A and worker B, both the two worker will execute the task, a worser case is the task might duplicated in more worker.
  2. The task hasn't been received by worker A, then task will not duplicate executed.

The first situation is not accepted.

Design Detail

In order to solve this, we should change the dispatch logic.

image

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct