etcd-io / raft

Raft library for maintaining a replicated state machine
Apache License 2.0
630 stars 160 forks source link

optimization: Leader log sampled handshake #150

Open pav-kv opened 7 months ago

pav-kv commented 7 months ago

Background: #144


At the moment, a raft node only accepts MsgApp log appends from the latest leader it knows about, i.e. when MsgApp.Term == raft.Term. This restriction could be relaxed, which can reduce the message turnaround during the times when the leader changes.

The safety requirement is that we don't accept entries that are not in the raft.Term leader log. If we can deduce that an entry is in the leader's log (before / other than by getting a MsgApp directly from this leader), we can always safely accept it.

One way to achieve this:


A more general way to achieve this is:

The practical K would be 2 or 3, because leader changes are typically not frequent. 2 or 3 last term changes cover a significant section of the log.

This sampling technique is equivalent to the fork point search that the leader does in the StateProbe state to establish the longest common prefix with the follower's log before transitioning it to the optimistic StateReplicate state.

This gives significant benefits:

This technique will minimize cluster disruption / slowdown during election, and reduce tail replication/commit latency in some cases.

joshuazh-x commented 7 months ago

Some concerns we may need to consider:

pav-kv commented 7 months ago

@joshuazh-x There is no need for the old leader to continue replication if it learns there is a new leader. Any message duplication in this proposal is already possible today (in cases leadership change races with replication, and/or there are connection issues).

  • Both new and old leaders may replicate entries to a follower until its log catches up with the longest common prefix of the two leader logs. Half of the entry payloads would be wasted.

I do not expect this to happen in normal operation, because the old leader will be notified about the existence of the new leader, and step down. The only difference is that, with this proposal, the last few append messages that the old leader has sent may have been [partially] accepted into the follower's log rather than outright rejected.

  • When would the old leader stop replicating entries to followers? Follower can response with specific flag when its log goes beyond the last fork point.

When it learns the new term. For example, this will happen when MsgAppResp with a new leader Term arrives. We don't want the old leader to continue replicate in parallel to the new leader, so we don't need to send it any hints/forks.

But what if there are unreachable followers.

Same thing will happen as today. Any leader will try to probe unreachable followers. The old leader will stop doing so and step down when it learns about the new leader, or things like CheckQuorum kick in.

  • When would the old leader steps down to follower and stop receiving client requests?

The moment it learns about the new leader Term (+some other conditions), same as today.