apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.67k stars 4.57k forks source link

[DSIP-9][Feature][Server] Add new registry plugin based on raft #10874

Open zhuxt2015 opened 2 years ago

zhuxt2015 commented 2 years ago

Search before asking

Description

The role of zookeeper

  1. Store service-related information of master and worker, host IP, port, cpu load, etc.
  2. Master and worker health check
  3. master failover
  4. Distributed lock

    Problems caused by zookeeper

  5. Increase the complexity of system deployment and operation and maintenance. While deploying the DS cluster, a zookeeper cluster should also be deployed. It is also necessary to monitor the operation and maintenance of the DS cluster and the zookeeper cluster at the same time
  6. Increase the chance of error. The network between the DS cluster and the zookeeper cluster is fluttering, and errors may occur.

Advantages of remove zookeeper

  1. The deployment architecture is simpler
  2. Lower maintenance cost
  3. No more DS errors due to zookeeper cluster errors

Blue print of remove zookeeper

image
  1. Refer to Kafka's KRaft solution to achieve leader election and data synchronization between masters
  2. The worker server registers with the leader master and maintains a heartbeat with the master
  3. Api server obtains cluster information from leader master

Use case

No response

Related issues

6680

Are you willing to submit a PR?

Code of Conduct

github-actions[bot] commented 2 years ago

Search before asking

Description

The current role of zookeeper

  1. Store service-related information of master and worker, host IP, port, cpu load, etc.
  2. Master and worker health check
  3. master failover
  4. Distributed lock

    Problems caused by zookeeper

  5. Increase the complexity of system deployment and operation and maintenance. While deploying the DS cluster, a zookeeper cluster should also be deployed. It is also necessary to monitor the operation and maintenance of the DS cluster and the zookeeper cluster at the same time
  6. Increase the chance of error. The network between the DS cluster and the zookeeper cluster is fluttering, and errors may occur.

remove zookeeper's point

  1. The deployment architecture is simpler
  2. Lower maintenance cost
  3. No more DS errors due to zookeeper cluster errors

remove zookeeper scheme

image
  1. Refer to Kafka's KRaft solution to achieve leader election and data synchronization between masters
  2. The worker server registers with the leader master and maintains a heartbeat with the master
  3. Api server obtains cluster information from leader master

Use case

No response

Related issues

6680

Are you willing to submit a PR?

Code of Conduct

github-actions[bot] commented 2 years ago

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

ruanwenjun commented 2 years ago

Great feature, this feature has been discussed before, it's needed to give a detailed design, such like how we store the log, how to solve the split-brain..., this is a good begin.

zhuxt2015 commented 2 years ago

@ruanwenjun ok, I'll give detail design later

lgcareer commented 2 years ago

@zhuxt2015 Greate feature,could you please introduce more detail about the leader master and other masters.

zhuxt2015 commented 2 years ago

leader选举

  1. master之间使用raft算法进行选举, 由raft算法保证不会出现脑裂的情况, raft算法演示
  2. raft算法需要半数以上的节点存活才能保证集群继续运行。3个master最多宕机1个, 5个master节点最多宕机2个,依次类推。
  3. 如果master存活少于半数, 所有剩余master自杀
  4. 如果leader master 死掉, 会在剩余的follower中重新选择leader, 并且重新选举的时间控制在几百毫秒以内。

    监控及信息存储

  5. worker和follower将原来写入zookeeper的心跳信息(HeartBeat.class), 通过Netty发送给leader节点, 由leader节点保存在内存中
  6. follower周期从leader获取master和worker信息,保存在内存中, 并作为leader的热备
  7. Api Server从任意master节点获取master和worker的监控信息
  8. 如果leader超过一定时间没有收到worker和follower的心跳信息, 会将他们移至dead列表,方便进行容错处理

Leader election

  1. The raft algorithm is used for election between masters, and the raft algorithm ensures that there will be no brain split. raft election demonstrate
  2. The raft algorithm requires more than half of the nodes to survive to keep the cluster running. 3 masters have a maximum of 1 down, 5 master nodes have a maximum of 2 downs, and so on.
  3. If less than half of the masters survive, all remaining masters commit suicide
  4. If the leader master dies, the leader will be re-selected among the remaining followers, and the time for re-election will be controlled within a few hundred milliseconds.

    Monitoring and information storage

  5. Workers and follower send the heartbeat information originally written to zookeeper (HeartBeat.class) to the leader node via Netty, which is kept in memory by the leader node
  6. The follower cycle takes master and worker information from the leader, saves it in memory, and serves as a hot spare for the leader
  7. Api Server obtains monitoring information for masters and workers from any master node
  8. If the leader does not receive heartbeat information from the worker and follower for more than a certain period of time, they will be moved to the dead list for fault tolerance
leo-lao commented 2 years ago

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

zhuxt2015 commented 2 years ago

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@leo-lao Great! Thank you for joining, I'm completed most of mostly functions, I will submit a PR this week, then let's discuss the subsequent division of development work.

ruanwenjun commented 2 years ago

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@leo-lao Great! Thank you for joining, I'm completed most of mostly functions, I will submit a PR this week, then let's discuss the subsequent division of development work.

Before submit PR, it's better to provide a detail design. The current design is not enough, we need to consider how to persistent data in disk, and how to implement the lock, how to maintain the data. Do we need to use some lib or we will implement the raft by ourselves.

zhuxt2015 commented 2 years ago

I will use sofa-jraft lib, here is Github Repository and User Guide

sofa-jraft introduction

SOFAJRAFT is a production-grade java implementation of RAFT consensus algorithm. SOFAJRaft is licensed under the Apache License 2.0. SOFAJRaft relies on some third-party components, and their open source protocol is also Apache License 2.0.

The core component is StateMachine and RheaKV .

StateMachine is an implementation of users’ core logic. It calls the onApply(Iterator) method to apply log entries that are submitted with Node#apply(task) to the business state machine.

RheaKV is a lightweight, distributed, and embedded KV storage library, which is included in the JRaft project as a submodule.

Ephemeral Node

All node information is stored in StateMachine's memory, StateMachine manages the registration and downtime of nodes, When a new node joins the cluster, a heartbeat packet is sent to the leader master, The last update time of the node is recorded and synchronized to all masters。When there is a node down, Ephemeral Node Refresh Thread scan records in StateMachine , When the last update time differs from the current time by more than a certain amount of time, nodes are removed and the removed results are synchronized to other masters.

image

Subscribe/Notify

The design of Subscribe/Notify is the same with ephemeral node, when leader master' StateMachine senses a data change in the server , then it will trigger the subscribed listener.

image

Global Lock

The design of global lock is the same with ephemeral node, there will be a KVStore in the StateMachine to store the lock info. RheaKVStore will store the lock of master server and clear the expiry lock.

image
zhongjiajie commented 2 years ago

@zhuxt2015 Please follow the dsip https://dolphinscheduler.apache.org/en-us/community/DSIP.html process to create DSIP, thanks

leo-lao commented 2 years ago

@zhuxt2015 I searched available raft implementations, and found apache ratis may be one better choice? SOFAJRAFT requires more dependencies, like

            <dependency>
                <groupId>com.alipay.sofa</groupId>
                <artifactId>bolt</artifactId>
                <version>${bolt.version}</version>
            </dependency>
            <dependency>
                <groupId>com.alipay.sofa</groupId>
                <artifactId>hessian</artifactId>
                <version>${hessian.version}</version>
            </dependency>

this is not so acceptable for one Apache Project.

On the other hand, I found apache ratis used by apache ozone and alluxio, which add more credit

zhuxt2015 commented 2 years ago

@leo-lao com.alipay.sofa.bolt and com.alipay.sofa.hessian use Apache License 2.0, I think they can be used. I also noticed ratis, it does not implement distributed lock, so not use it.

leo-lao commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

ruanwenjun commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

leo-lao commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

In fact not, with raft introduced, we will introduce leader and followers in this system, no need to rely on other registry plugins.

method details pros cons
raft with distributed lock multiple master handling tasks no pressure on single master extra dependency/implementation of distributed lock
raft without distributed lock only the raft leader handling tasks simple, popular way(like Apache Ozone, Alibaba RemoteShuffuleService) pressure on single master

I feel both are OK

ruanwenjun commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

In fact not, with raft introduced, we will introduce leader and followers in this system, no need to rely on other registry plugins.

method details pros cons
raft with distributed lock multiple master handling tasks no pressure on single master extra dependency/implementation of distributed lock raft without distributed lock only the raft leader handling tasks simple, popular way(like Apache Ozone, Alibaba RemoteShuffuleService) pressure on single master I feel both are OK

We need to reach a consensus that we import raft just as a new registry plugin, this will not affect our existing plugin.

zhuxt2015 commented 2 years ago

@leo-lao For the time being, backward compatibility should be guaranteed, The user can choose to use raft registry or zookeeper registry.

leo-lao commented 2 years ago

Gotcha, then your idea works