Open pav-kv opened 3 weeks ago
Some analysis of the message types used today.
More than half the messages are "local", and can be replaced with type-safe methods:
// Control. Never pop up in Ready (except MsgProp).
MsgHup = 0;
MsgBeat = 1;
// MsgProp wrinkle: can be forwarded to leader, needs API support.
// Can be deprecated eventually when leader forwarding is not needed.
MsgProp = 2;
MsgUnreachable = 10;
MsgSnapStatus = 11;
MsgCheckQuorum = 12;
MsgTransferLeader = 13;
MsgForgetLeader = 23;
// Local storage. Special API with sequential semantics.
MsgStorageAppend = 19;
MsgStorageAppendResp = 20;
MsgStorageApply = 21;
MsgStorageApplyResp = 22;
There are ~3.5 groups of messages that need to be sent remotely (and only they need to be serialized):
// Replication flow. Needs fine-grained control for Replication AC.
MsgApp = 3;
MsgAppResp = 4;
MsgSnap = 7; // can be better integrated with MsgSnapStatus
// Election flow.
MsgVote = 5;
MsgVoteResp = 6;
MsgPreVote = 17;
MsgPreVoteResp = 18;
MsgFortify & co
// Heartbeats. Can be deprecated in favour of the "liveness" API,
// which can be integrated with MsgUnreachable.
MsgHeartbeat = 8;
MsgHeartbeatResp = 9;
// MsgTimeoutNow has limited use. Triggers campaign, stepped on a few occasions:
// - local call to force campaign bypassing proper transfer
// - on MsgTransferLeader (which is a local call) when the transferee is caught up
// - on MsgAppResp from the transferee when it's caught up
MsgTimeoutNow = 14;
cc @cockroachdb/replication
Message types by role {Proposer, Acceptor, Learner}:
Proposer:
// local
MsgHup = 0;
MsgBeat = 1;
MsgProp = 2; // can also be received
MsgUnreachable = 10;
MsgSnapStatus = 11;
MsgCheckQuorum = 12;
MsgTransferLeader = 13;
MsgTimeoutNow = 14;
// sends
MsgVote = 5;
MsgPreVote = 17;
MsgApp = 3;
MsgSnap = 7; // should this be in (distinguished) Learner?
MsgHeartbeat = 8;
// receives
MsgVoteResp = 6;
MsgPreVoteResp = 18;
MsgAppResp = 4;
MsgHeartbeat = 8; // prevents Proposer from campaigning
MsgHeartbeatResp = 9; // responses from other Proposers/Acceptors
Acceptor:
// local
MsgForgetLeader = 23;
// receives
MsgVote = 5;
MsgPreVote = 17;
MsgApp = 3;
MsgHeartbeat = 8;
// sends
MsgVoteResp = 6;
MsgAppResp = 4;
MsgPreVoteResp = 18;
MsgHeartbeatResp = 9;
// local storage
MsgStorageAppend = 19;
MsgStorageAppendResp = 20;
Learner:
// receives
MsgApp = 3;
MsgSnap = 7;
// sends
MsgAppResp = 4;
// local storage
MsgStorageApply = 21;
MsgStorageApplyResp = 22;
The
raft
API combines all flows into one interface:(NB: this uniformity is already not used in a few places, e.g. we have methods like
ProposeConfChange
,ApplyConfChange
,ReportUnreachable
,ReportSnapshot
that fall out of this pattern; we’ll have more of these with Replication Admission Control)The downsides of this API spread a long way into the CRDB code. We often use reasoning like:
MsgAppResp
; from theMsgAppResp
, we infer the durable log index.MsgStorageAppend
, it will containResponses
containingMsgStorageAppendResp
andMsgAppResp
to other nodes - we infer some information from that.MsgProp
, raft will populate index/term in theEntries
, so we can use this side effect to understand where this append landed in the log.There is a lot of unsafe switch-casing based on message types, and breaking the raft’s abstractions.
I think it would be cleaner if all exchanges with
raft.RawNode
were strongly typed methods, not messages, and the responsibility of packing/unpacking these to messages would be on the upstream code.Raft is a collection of multiple semi-independent state machines. There is one for interaction with the local storage, one for interaction with every follower/peer (the
MsgApp
flows), and more. Each of these state machines can be “ready” at different times. Also, each of these state machines effectively has its own step/ready API and its own message types. For example:Step/Ready/Message
, we can have aReady
-style method that returns a concrete type encapsulating this state machine’s updates, and have aStep
-style method with concrete types that puts the results back.GetLogWrite
andStepLogWriteResponse
methods would do.HasReady
becomes a bitmask / descriptor of all the pieces that are ready. It is input into the scheduler.This can translate to raft scheduler having one “ready” handler per sub-state-machine. This allows scheduling individual pieces of
Ready
processing independently. In many cases, it is cheaper than currently: todayReady
is a disjunction of all individual ready-s; we have to compute this disjunction at all times, andhandleRaftReady
also needs to check every part of theReady
struct to be non-empty, so is also a disjunction of all Ready-s.Benefits of breaking the API into flows:
MsgApp
) can be throttled by the upper layer and be subject to AC / memory budgeting etc.raftpb.Message
, and “compact” all the messages, pertaining to one sub-state-machine, that often cancel each other out, into a single struct per ready iteration. For example, instead of generating aMsgApp
eagerly on every append, we can compute one up-to-date append request at ready handling time; similarly, stale vote requests are not interesting, we only need to send the latest ones for the currentTerm
.raftpb.Message
, and eventually the library can have no dependency onprotobuf
(which allows swapping to other wire protocols if we want to).RaftTransport
. All other code would be strongly typed with cleaner semantics.Jira issue: CRDB-39559