zhuxt2015 commented 2 years ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

The role of zookeeper

Store service-related information of master and worker, host IP, port, cpu load, etc.
Master and worker health check
master failover
Distributed lock
Problems caused by zookeeper
Increase the complexity of system deployment and operation and maintenance. While deploying the DS cluster, a zookeeper cluster should also be deployed. It is also necessary to monitor the operation and maintenance of the DS cluster and the zookeeper cluster at the same time
Increase the chance of error. The network between the DS cluster and the zookeeper cluster is fluttering, and errors may occur.

Advantages of remove zookeeper

The deployment architecture is simpler
Lower maintenance cost
No more DS errors due to zookeeper cluster errors

Blue print of remove zookeeper

Refer to Kafka's KRaft solution to achieve leader election and data synchronization between masters
The worker server registers with the leader master and maintains a heartbeat with the master
Api server obtains cluster information from leader master

Use case

No response

Related issues

6680

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

github-actions[bot] commented 2 years ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

The current role of zookeeper

Store service-related information of master and worker, host IP, port, cpu load, etc.
Master and worker health check
master failover
Distributed lock
Problems caused by zookeeper
Increase the complexity of system deployment and operation and maintenance. While deploying the DS cluster, a zookeeper cluster should also be deployed. It is also necessary to monitor the operation and maintenance of the DS cluster and the zookeeper cluster at the same time
Increase the chance of error. The network between the DS cluster and the zookeeper cluster is fluttering, and errors may occur.

remove zookeeper's point

The deployment architecture is simpler
Lower maintenance cost
No more DS errors due to zookeeper cluster errors

remove zookeeper scheme

Refer to Kafka's KRaft solution to achieve leader election and data synchronization between masters
The worker server registers with the leader master and maintains a heartbeat with the master
Api server obtains cluster information from leader master

Use case

No response

Related issues

6680

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

github-actions[bot] commented 2 years ago

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

ruanwenjun commented 2 years ago

Great feature, this feature has been discussed before, it's needed to give a detailed design, such like how we store the log, how to solve the split-brain..., this is a good begin.

zhuxt2015 commented 2 years ago

@ruanwenjun ok, I'll give detail design later

lgcareer commented 2 years ago

@zhuxt2015 Greate feature,could you please introduce more detail about the leader master and other masters.

zhuxt2015 commented 2 years ago

leader选举

master之间使用raft算法进行选举, 由raft算法保证不会出现脑裂的情况, raft算法演示
raft算法需要半数以上的节点存活才能保证集群继续运行。3个master最多宕机1个, 5个master节点最多宕机2个,依次类推。
如果master存活少于半数, 所有剩余master自杀
如果leader master 死掉, 会在剩余的follower中重新选择leader, 并且重新选举的时间控制在几百毫秒以内。
监控及信息存储
worker和follower将原来写入zookeeper的心跳信息(HeartBeat.class), 通过Netty发送给leader节点, 由leader节点保存在内存中
follower周期从leader获取master和worker信息,保存在内存中, 并作为leader的热备
Api Server从任意master节点获取master和worker的监控信息
如果leader超过一定时间没有收到worker和follower的心跳信息, 会将他们移至dead列表,方便进行容错处理

Leader election

The raft algorithm is used for election between masters, and the raft algorithm ensures that there will be no brain split. raft election demonstrate
The raft algorithm requires more than half of the nodes to survive to keep the cluster running. 3 masters have a maximum of 1 down, 5 master nodes have a maximum of 2 downs, and so on.
If less than half of the masters survive, all remaining masters commit suicide
If the leader master dies, the leader will be re-selected among the remaining followers, and the time for re-election will be controlled within a few hundred milliseconds.
Monitoring and information storage
Workers and follower send the heartbeat information originally written to zookeeper (HeartBeat.class) to the leader node via Netty, which is kept in memory by the leader node
The follower cycle takes master and worker information from the leader, saves it in memory, and serves as a hot spare for the leader
Api Server obtains monitoring information for masters and workers from any master node
If the leader does not receive heartbeat information from the worker and follower for more than a certain period of time, they will be moved to the dead list for fault tolerance

leo-lao commented 2 years ago

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

zhuxt2015 commented 2 years ago

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@leo-lao Great! Thank you for joining, I'm completed most of mostly functions, I will submit a PR this week, then let's discuss the subsequent division of development work.

ruanwenjun commented 2 years ago

Hi @zhuxt2015 I am also interested in implementing this issue. Maybe I can join the discussion of design and implementation

@leo-lao Great! Thank you for joining, I'm completed most of mostly functions, I will submit a PR this week, then let's discuss the subsequent division of development work.

Before submit PR, it's better to provide a detail design. The current design is not enough, we need to consider how to persistent data in disk, and how to implement the lock, how to maintain the data. Do we need to use some lib or we will implement the raft by ourselves.

zhuxt2015 commented 2 years ago

I will use sofa-jraft lib, here is Github Repository and User Guide

sofa-jraft introduction

SOFAJRAFT is a production-grade java implementation of RAFT consensus algorithm. SOFAJRaft is licensed under the Apache License 2.0. SOFAJRaft relies on some third-party components, and their open source protocol is also Apache License 2.0.

The core component is StateMachine and RheaKV .

StateMachine is an implementation of users’ core logic. It calls the onApply(Iterator) method to apply log entries that are submitted with Node#apply(task) to the business state machine.

RheaKV is a lightweight, distributed, and embedded KV storage library, which is included in the JRaft project as a submodule.

Ephemeral Node

All node information is stored in StateMachine's memory, StateMachine manages the registration and downtime of nodes, When a new node joins the cluster, a heartbeat packet is sent to the leader master, The last update time of the node is recorded and synchronized to all masters。When there is a node down, Ephemeral Node Refresh Thread scan records in StateMachine , When the last update time differs from the current time by more than a certain amount of time, nodes are removed and the removed results are synchronized to other masters.

Subscribe/Notify

The design of Subscribe/Notify is the same with ephemeral node, when leader master' StateMachine senses a data change in the server , then it will trigger the subscribed listener.

Global Lock

The design of global lock is the same with ephemeral node, there will be a KVStore in the StateMachine to store the lock info. RheaKVStore will store the lock of master server and clear the expiry lock.

zhongjiajie commented 2 years ago

@zhuxt2015 Please follow the dsip https://dolphinscheduler.apache.org/en-us/community/DSIP.html process to create DSIP, thanks

leo-lao commented 2 years ago

@zhuxt2015 I searched available raft implementations, and found apache ratis may be one better choice? SOFAJRAFT requires more dependencies, like

            <dependency>
                <groupId>com.alipay.sofa</groupId>
                <artifactId>bolt</artifactId>
                <version>${bolt.version}</version>
            </dependency>
            <dependency>
                <groupId>com.alipay.sofa</groupId>
                <artifactId>hessian</artifactId>
                <version>${hessian.version}</version>
            </dependency>

this is not so acceptable for one Apache Project.

On the other hand, I found apache ratis used by apache ozone and alluxio, which add more credit

zhuxt2015 commented 2 years ago

@leo-lao com.alipay.sofa.bolt and com.alipay.sofa.hessian use Apache License 2.0, I think they can be used. I also noticed ratis, it does not implement distributed lock, so not use it.

leo-lao commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

ruanwenjun commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

leo-lao commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

In fact not, with raft introduced, we will introduce leader and followers in this system, no need to rely on other registry plugins.

method	details	pros	cons
raft with distributed lock	multiple master handling tasks	no pressure on single master	extra dependency/implementation of distributed lock
raft without distributed lock	only the raft leader handling tasks	simple, popular way(like Apache Ozone, Alibaba RemoteShuffuleService)	pressure on single master

I feel both are OK

ruanwenjun commented 2 years ago

Is it is a must to implement distributed lock in dolphinscheduler? As far as I know, distributed lock in current version, is used for making sure only one master handling requests or failover event. Is it ok If we just put above logics in Leader Master?

This is not a good idea, you need to import the leader role in other registry plugin.

In fact not, with raft introduced, we will introduce leader and followers in this system, no need to rely on other registry plugins.

method details pros cons
raft with distributed lock multiple master handling tasks no pressure on single master extra dependency/implementation of distributed lock raft without distributed lock only the raft leader handling tasks simple, popular way(like Apache Ozone, Alibaba RemoteShuffuleService) pressure on single master I feel both are OK

We need to reach a consensus that we import raft just as a new registry plugin, this will not affect our existing plugin.

zhuxt2015 commented 2 years ago

@leo-lao For the time being, backward compatibility should be guaranteed, The user can choose to use raft registry or zookeeper registry.

leo-lao commented 2 years ago

Gotcha, then your idea works

apache / dolphinscheduler

[DSIP-9][Feature][Server] Add new registry plugin based on raft #10874

Search before asking

Description

The role of zookeeper

Problems caused by zookeeper

Advantages of remove zookeeper

Blue print of remove zookeeper

Use case

Related issues

6680

Are you willing to submit a PR?

Code of Conduct

Search before asking

Description

The current role of zookeeper

Problems caused by zookeeper

remove zookeeper's point

remove zookeeper scheme

Use case

Related issues

6680

Are you willing to submit a PR?

Code of Conduct

leader选举

监控及信息存储

Leader election

Monitoring and information storage

sofa-jraft introduction

Ephemeral Node

Subscribe/Notify

Global Lock