apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.89k stars 4.63k forks source link

[DSIP-82][Master/Worker] Use FAILOVER_FINISH_NODES to avoid duplicate workflow/task when failover #16825

Closed ruanwenjun closed 2 days ago

ruanwenjun commented 2 days ago

Search before asking

Motivation

When the master/worker disconnect from registry, then it might reconnect latter. e.g. We use curator to connect to zk, if the session timeout is 120s, the server will go into suspend if the heartbeat is failure in 80s, and then it will reconnect to another zk node, if reconnect success, then the server continue work. But sometimes, other server might receive a disconnect event of the reconnect server in this case.

We need to make sure if someone has failover a node, then the node must go died.

Design Detail

We import a FAILOVER_FINISH_NODES in registry, each server use address+server startup time as it's identify, once a server has been failovered, then it will be put under FAILOVER_FINISH_NODES, so if someone find it is under FAILOVER_FINISH_NODES then it should go died.

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct