Open ahrtr opened 6 months ago
cc @pav-kv @erikgrinaker PTAL
I assume the motivation of this change comes from https://github.com/etcd-io/etcd/issues/17455. However, there might be a potential side effect if we don't reset election timeout.
When such a node does not reset its election timeout, it may quickly start a new campaign and makes the winning candidate step down to follower again before it becomes leader. It can be even worse if the node cannot win election (for example, it does not have sufficient connections to other nodes due to network issue). Then the node may keep stopping other candidates from becoming leader.
I assume the motivation of this change comes from etcd-io/etcd#17455.
Yes.
When such a node does not reset its election timeout, it may quickly start a new campaign and makes the winning candidate step down to follower again before it becomes leader.
This PR doesn't change the local node's frequency to campaign. The PR just prevents any node being affected by other unqualified candidate.
Prefer to the alternative PR https://github.com/etcd-io/raft/pull/169,
electionElapsed
.@ahrtr
In this PR, the local node's term has already increased by 1 even it rejects the Vote. So Conceptually it should reset the electionElapsed.
Why does it conceptually need to reset the timeout? The timeout behaviours are heuristics, so there is no right or wrong ways. We should consider behaviours and do some calculations to compare heuristics.
Do we need to reset electionTimeout
at all in situations when a non-leader state remains non-leader? E.g. when we BecomeFollower
, and we were already in StateFollower
, just with a lower term. Upd: I guess we do need to sometimes reset it, because of https://github.com/etcd-io/raft/pull/167#issuecomment-1955813036.
Upd: I guess we do need to sometimes reset it, because of #167 (comment).
YES, it's one of the reasons why I prefer to https://github.com/etcd-io/raft/pull/169. The other reason is #169 has better readability.
Why does it conceptually need to reset the timeout? The timeout behaviours are heuristics, so there is no right or wrong ways. We should consider behaviours and do some calculations to compare heuristics.
Do we need to reset
electionTimeout
at all in situations when a non-leader state remains non-leader?
All election time parameters (e.g. randomizedElectionTimeout
) are supposed to be bind with each term. If the term changes (increases), then we should reset all such time parameters. At least, this is current implementation. If we don't reset the value, and carry over to next term, we don't know what other side effect it may cause apart from https://github.com/etcd-io/raft/pull/167#issuecomment-1955813036. I agree with "there is no right or wrong ways
", but I tend not to take risk for now.