raft: make the API type-safe

pav-kv commented 3 weeks ago

The raft API combines all flows into one interface:

Message is a union of all possible messages (local and remote)
One Step for putting messages in
One Ready for getting anything out of it

(NB: this uniformity is already not used in a few places, e.g. we have methods like ProposeConfChange, ApplyConfChange, ReportUnreachable, ReportSnapshot that fall out of this pattern; we’ll have more of these with Replication Admission Control)

The downsides of this API spread a long way into the CRDB code. We often use reasoning like:

When something happens, raft will send us a MsgAppResp; from the MsgAppResp, we infer the durable log index.
Or: when we receive a MsgStorageAppend, it will contain Responses containing MsgStorageAppendResp and MsgAppResp to other nodes - we infer some information from that.
Or: when we step a MsgProp, raft will populate index/term in the Entries, so we can use this side effect to understand where this append landed in the log.

There is a lot of unsafe switch-casing based on message types, and breaking the raft’s abstractions.

I think it would be cleaner if all exchanges with raft.RawNode were strongly typed methods, not messages, and the responsibility of packing/unpacking these to messages would be on the upstream code.

Raft is a collection of multiple semi-independent state machines. There is one for interaction with the local storage, one for interaction with every follower/peer (the MsgApp flows), and more. Each of these state machines can be “ready” at different times. Also, each of these state machines effectively has its own step/ready API and its own message types. For example:

Local storage state machine is “ready” when it has anything to send to the storage.
So we can have an individual ready handling flow for it.
This state machine sends a few things to storage, and will receive a response later.
Instead of multiplexing this behaviour into the union of all behaviours in Step/Ready/Message, we can have a Ready-style method that returns a concrete type encapsulating this state machine’s updates, and have a Step-style method with concrete types that puts the results back.
E.g. GetLogWrite and StepLogWriteResponse methods would do.
HasReady becomes a bitmask / descriptor of all the pieces that are ready. It is input into the scheduler.

This can translate to raft scheduler having one “ready” handler per sub-state-machine. This allows scheduling individual pieces of Ready processing independently. In many cases, it is cheaper than currently: today Ready is a disjunction of all individual ready-s; we have to compute this disjunction at all times, and handleRaftReady also needs to check every part of the Ready struct to be non-empty, so is also a disjunction of all Ready-s.

Benefits of breaking the API into flows:

There is a finer grained control over the execution. All of the “heavy” flows (log storage, apply, MsgApp) can be throttled by the upper layer and be subject to AC / memory budgeting etc.
“Urgent” messages (like voting, heartbeats/liveness, etc) can be stepped eagerly instead of being queued with all other messages.
Most/all of the sub-state-machines of raft can compute the “ready” state/message on the fly. This would eliminate the need for slices of abstract raftpb.Message, and “compact” all the messages, pertaining to one sub-state-machine, that often cancel each other out, into a single struct per ready iteration. For example, instead of generating a MsgApp eagerly on every append, we can compute one up-to-date append request at ready handling time; similarly, stale vote requests are not interesting, we only need to send the latest ones for the current Term.
Strong types. There is no “union of all messages” and “union of all behaviours”.
This makes it easier to reason about things like: what is the full list of places where we can step a message of a particular type, or what is the lifetime of a particular message type? Browse for all method calls corresponding to this event and where this type is used.
No dependency on raftpb.Message, and eventually the library can have no dependency on protobuf (which allows swapping to other wire protocols if we want to).
The serialization / switch-casing based on message types would be consolidated in a few places, somewhere close to RaftTransport. All other code would be strongly typed with cleaner semantics.

Jira issue: CRDB-39559

pav-kv commented 3 weeks ago

Some analysis of the message types used today.

More than half the messages are "local", and can be replaced with type-safe methods:

// Control. Never pop up in Ready (except MsgProp).
MsgHup               = 0;
MsgBeat              = 1;
// MsgProp wrinkle: can be forwarded to leader, needs API support.
// Can be deprecated eventually when leader forwarding is not needed.
MsgProp              = 2;
MsgUnreachable       = 10;
MsgSnapStatus        = 11;
MsgCheckQuorum       = 12;
MsgTransferLeader    = 13;
MsgForgetLeader      = 23;

// Local storage. Special API with sequential semantics.
MsgStorageAppend     = 19;
MsgStorageAppendResp = 20;
MsgStorageApply      = 21;
MsgStorageApplyResp  = 22;

There are ~3.5 groups of messages that need to be sent remotely (and only they need to be serialized):

// Replication flow. Needs fine-grained control for Replication AC.
MsgApp               = 3;
MsgAppResp           = 4;
MsgSnap              = 7; // can be better integrated with MsgSnapStatus

// Election flow.
MsgVote              = 5;
MsgVoteResp          = 6;
MsgPreVote           = 17;
MsgPreVoteResp       = 18;
MsgFortify & co

// Heartbeats. Can be deprecated in favour of the "liveness" API,
// which can be integrated with MsgUnreachable.
MsgHeartbeat         = 8;
MsgHeartbeatResp     = 9;

// MsgTimeoutNow has limited use. Triggers campaign, stepped on a few occasions:
//  - local call to force campaign bypassing proper transfer
//  - on MsgTransferLeader (which is a local call) when the transferee is caught up
//  - on MsgAppResp from the transferee when it's caught up
MsgTimeoutNow        = 14;

blathers-crl[bot] commented 3 weeks ago

cc @cockroachdb/replication

pav-kv commented 2 weeks ago

Message types by role {Proposer, Acceptor, Learner}:

Proposer:

// local
MsgHup               = 0;
MsgBeat              = 1;
MsgProp              = 2; // can also be received
MsgUnreachable       = 10;
MsgSnapStatus        = 11;
MsgCheckQuorum       = 12;
MsgTransferLeader    = 13;
MsgTimeoutNow        = 14;

// sends
MsgVote              = 5;
MsgPreVote           = 17;
MsgApp               = 3;
MsgSnap              = 7; // should this be in (distinguished) Learner?
MsgHeartbeat         = 8;

// receives
MsgVoteResp          = 6;
MsgPreVoteResp       = 18;
MsgAppResp           = 4;
MsgHeartbeat         = 8; // prevents Proposer from campaigning
MsgHeartbeatResp     = 9; // responses from other Proposers/Acceptors

Acceptor:

// local
MsgForgetLeader      = 23;

// receives
MsgVote              = 5;
MsgPreVote           = 17;
MsgApp               = 3;
MsgHeartbeat         = 8;

// sends
MsgVoteResp          = 6;
MsgAppResp           = 4;
MsgPreVoteResp       = 18;
MsgHeartbeatResp     = 9;

// local storage
MsgStorageAppend     = 19;
MsgStorageAppendResp = 20;

Learner:

// receives
MsgApp               = 3;
MsgSnap              = 7;
// sends
MsgAppResp           = 4;

// local storage
MsgStorageApply      = 21;
MsgStorageApplyResp  = 22;

cockroachdb / cockroach

raft: make the API type-safe #125693