Closed pav-kv closed 5 months ago
The
n.Status().SoftState.Lead
that it collects from all the nodes only represents votes
This statement needs checking. After a quick scan of the code, I see that it might be false. The lead
field is updated only when getting messages that only a leader can send.
But the problem is in these lines:
raft2024/03/08 13:35:12 INFO: 3 became leader at term 2
...
2024/03/08 13:35:12 raft.2: stop
...
2024/03/08 13:35:12 raft.3: stop
Node 3 is the leader, but it was stopped. However, the test code does not intend to stop the leader, which means we might have detected the leader incorrectly.
Hi @pav-kv , based on the log you provided, I think I may have figured out why it will stop a leader.
waitLeader
function returns is 0. (It's not expected, but possible, I'll explain why)waitLeader
,
for i, n := range ns {
lead := n.Status().SoftState.Lead
if lead != 0 {
l[lead] = struct{}{}
if n.id == lead {
lindex = i
}
}
}
if len(l) == 1 {
return lindex
}
- Let's say, i is 4 now, and the leader of node[4] has changed from 0 to 3 in step2
- So `l[lead] = struct{}{}` will be executed.
- But `n.id == lead` is false, lindex still equals 0.
- The iteration ends, and `len(l)` now equals 1.
- This function will return lindex, which is 0.
Another thing is, it is also possible that `n.Status()` could be blocked if the event loop in the `run` function does not handle the status channel in time, potentially allowing for the 4th iteration takes more time to execute, and give time for the Leader update. Meaning 3th iteration still don't have a Leader, while 4th iteration elected one.(Not sure about this)
Hi @pav-kv , I would like to submit a pull request to address this issue. My proposed solution involves using -1 as the initial value of 'lindex' to prevent confusion between 'not found' and 'first element'. The code would look something like this. Would this solution be acceptable to you?
func waitLeader(ns []*node) int {
var l map[uint64]struct{}
// Use -1 instead of 0 to avoid confusion between 'not found' and 'first element'.
var lindex = -1
for {
l = make(map[uint64]struct{})
for i, n := range ns {
lead := n.Status().SoftState.Lead
if lead != 0 {
// Set l[lead] if lead != 0 to detect a two-leader case. Since a follower can also have an old leader.
l[lead] = struct{}{}
// Update lindex when a node itself becomes a leader to avoid setting a follower's old leader as lindex.
if n.id == lead {
lindex = i
}
}
}
if len(l) == 1 && lindex != -1 {
return lindex
}
}
}
Hi @pav-kv , Whenever you get a chance, would you mind taking a look at this proposal? No pressure at all, I'd just really appreciate your input and ideas. Thanks!
TestRestart
failed (when run many times) at commit ed26e90 with the following log:Looking at the test implementation, I suspect the problem is in the waitLeader function. It waits for a wrong signal, and erroneously reports some non-leader node to be the leader. The
n.Status().SoftState.Lead
that it collects from all the nodes only represents votes, but the actual elected leader can end up different.A fix would be to wait for a node that is in
StateLeader
.