apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.73k stars 4.58k forks source link

[Bug] [worker-server] server down,but task not killed,restart woker-server will start new task #16454

Closed 13813586515 closed 1 month ago

13813586515 commented 1 month ago

Search before asking

What happened

目前部署方式1主3从, 1.通过ds配置了st任务 2.将三台work-server全部停止 3.依次启动3台worker-server 出现以下问题 1.三台worker-server宕机没有killed掉st任务,这主要原因是ds只负责提交任务到st,实际任务运行是由st server来运行,但是当work-server再次启动的时候会发现之前的任务意外停止了 会重新启动新的任务,此时会出现原先的st任务被double了,然后cpu和内存会被撑满,而ds中会出现同样的任务有2条,一条正在运行,一条状态是需要容错

What you expected to happen

ds对于st的任务监控还不完备,拉起新的st任务的时候没有去stserver查看任务的实际运行情况

How to reproduce

1.通过ds配置了st任务 2.将三台work-server全部停止 3.依次启动3台worker-server

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

Code of Conduct

github-actions[bot] commented 1 month ago

Search before asking

What happened

The current deployment mode is 1 master and 3 slaves.

  1. Configure the st task through ds
  2. Stop all three work-servers
  3. Start 3 worker-servers in sequence The following problem occurs
  4. The three worker-servers were down and did not kill the st task. The main reason for this is that ds is only responsible for submitting tasks to st. The actual task execution is run by the st server. However, when the work-server is started again, the previous task will be found. If the task stops unexpectedly, a new task will be restarted. At this time, the original st task will be doubled, and then the CPU and memory will be full. There will be two identical tasks in ds, one is running and the other is in status. It requires fault tolerance

What you expected to happen

ds's task monitoring for st is not complete yet. When launching a new st task, it did not go to stserver to check the actual running status of the task.

How to reproduce

  1. Configure the st task through ds
  2. Stop all three work-servers
  3. Start 3 worker-servers in sequence

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

Code of Conduct

SbloodyS commented 1 month ago

Duplicated with #16442