baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.99k stars 886 forks source link

Leader grant vote to reboot node occasionally. #365

Open zhangh43 opened 2 years ago

zhangh43 commented 2 years ago

I have a raft group ng0 consists of three nodes(A(localhost:8001), B(localhost:8011), C(localhost:8021)), timeline is as follows: T1: A is elected as the leader of group ng0. term 2. T2: Restart B. T3: B start prevote for ng0, with log message node ng0:127.0.0.1:8011 term 2 start pre_vote T4: B got prevote grant ack from A with log message node ng0:127.0.0.1:8011:1 received PreVoteResponse from 127.0.0.1:8001:0 term 2 granted 1 rejected_by_lease 0 disrupted 1 T5: B start to vote. T6: B got vote grant ack from A. node ng0:127.0.0.1:8011:1 received RequestVoteResponse from 127.0.0.1:8001:0 term 3 granted 1 rejected_by_lease 0 disrupted 1 T7: B got vote reject from C. node ng0:127.0.0.1:8011:1 received RequestVoteResponse from 127.0.0.1:8021:0 term 2 granted 0 rejected_by_lease 1 disrupted 0 T8: B become the leader of ng0 with term 3.

It's quite strange that old leader A will accept the prevote and vote from reboot node B. This only happens occasionally, but I want to know it's an expect behavior or a random bug in braft.

zhangh43 commented 2 years ago

Also note that C still treat A as the leader of ng0 and reject B as the new leader. How can A give up the leadership by itself?

zhangh43 commented 2 years ago

从以下braft文档中,可以看到braft特意处理了这种节点再次上线,打断复制组的问题。但是我上面的case中,Leader节点A还是接受了重启节点B的prevote和vote,并将自己stepdown。这个问题和lease有关系吗? 我节点A的electiontimeout是1000ms

Symmetric network partitioning
原始的RAFT论文中对于对称网络划分的处理是,一个节点再次上线之后,Leader接收到高于currentTerm的RequestVote请求就进行StepDown。这样即使这个节点已经通过RemovePeer删除了,依然会打断当前的Lease,导致复制组不可用。对于这种case可以做些特殊的处理:Leader不接收RequestVote请求,具体情况如下:

对于属于PeerSet中的节点,Leader会在重试的AppendEntries中因为遇到更高的term而StepDown
对于不属于PeerSet中的节点,Leader永远忽略
PFZheng commented 2 years ago

这儿是个bug。现在的代码里,pre vote 的时候会使用本地term+1作为term,这就导致A节点term==pre vote term,这个case在braft里是可以过prevote的,实际上应该拒绝掉。

zhangh43 commented 2 years ago

有计划下个版本fix吗? 比如在leader check prevote的时候加一个condition, 如果leader lease还有效就reject?

MrGuin commented 2 years ago

这儿是个bug。现在的代码里,pre vote 的时候会使用本地term+1作为term,这就导致A节点term==pre vote term,这个case在braft里是可以过prevote的,实际上应该拒绝掉。

A 节点作为 leader 是有 lease 的,我觉得关键点是为什么 A 节点没有 reject_by_lease,我看代码每个 node 是根据 follower_lease 来 reject prevote 的,而 follower_lease 是在 handle_append_entries_request 时 renew 的,那 leader 节点的 follower_lease 是怎么更新的呢?leader 节点也会给自己发 append_entries 吗?

zhangh43 commented 2 years ago

另外问一下B节点在重启之后什么情况下会做Prevote,因为Issue问题是random的,大多数B节点重启会直接加入raft group,而不做Prevote。

PFZheng commented 2 years ago

这儿是个bug。现在的代码里,pre vote 的时候会使用本地term+1作为term,这就导致A节点term==pre vote term,这个case在braft里是可以过prevote的,实际上应该拒绝掉。

A 节点作为 leader 是有 lease 的,我觉得关键点是为什么 A 节点没有 reject_by_lease,我看代码每个 node 是根据 follower_lease 来 reject prevote 的,而 follower_lease 是在 handle_append_entries_request 时 renew 的,那 leader 节点的 follower_lease 是怎么更新的呢?leader 节点也会给自己发 append_entries 吗?

主上不存在follower lease。braft允许follower抢主,抢到主的票之后,其他follower是可以打断现有的follower lease的

PFZheng commented 2 years ago

有计划下个版本fix吗? 比如在leader check prevote的时候加一个condition, 如果leader lease还有效就reject?

我来fix一下

PFZheng commented 2 years ago

这个可能和时机有关系,leader给B发的心跳先到达就不会发起prevote