cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.51k stars 3.7k forks source link

raft: make the API type-safe #125693

Open pav-kv opened 3 weeks ago

pav-kv commented 3 weeks ago

The raft API combines all flows into one interface:

(NB: this uniformity is already not used in a few places, e.g. we have methods like ProposeConfChange, ApplyConfChange, ReportUnreachable, ReportSnapshot that fall out of this pattern; we’ll have more of these with Replication Admission Control)

The downsides of this API spread a long way into the CRDB code. We often use reasoning like:

There is a lot of unsafe switch-casing based on message types, and breaking the raft’s abstractions.

I think it would be cleaner if all exchanges with raft.RawNode were strongly typed methods, not messages, and the responsibility of packing/unpacking these to messages would be on the upstream code.

Raft is a collection of multiple semi-independent state machines. There is one for interaction with the local storage, one for interaction with every follower/peer (the MsgApp flows), and more. Each of these state machines can be “ready” at different times. Also, each of these state machines effectively has its own step/ready API and its own message types. For example:

This can translate to raft scheduler having one “ready” handler per sub-state-machine. This allows scheduling individual pieces of Ready processing independently. In many cases, it is cheaper than currently: today Ready is a disjunction of all individual ready-s; we have to compute this disjunction at all times, and handleRaftReady also needs to check every part of the Ready struct to be non-empty, so is also a disjunction of all Ready-s.

Benefits of breaking the API into flows:

Jira issue: CRDB-39559

pav-kv commented 3 weeks ago

Some analysis of the message types used today.

More than half the messages are "local", and can be replaced with type-safe methods:

// Control. Never pop up in Ready (except MsgProp).
MsgHup               = 0;
MsgBeat              = 1;
// MsgProp wrinkle: can be forwarded to leader, needs API support.
// Can be deprecated eventually when leader forwarding is not needed.
MsgProp              = 2;
MsgUnreachable       = 10;
MsgSnapStatus        = 11;
MsgCheckQuorum       = 12;
MsgTransferLeader    = 13;
MsgForgetLeader      = 23;

// Local storage. Special API with sequential semantics.
MsgStorageAppend     = 19;
MsgStorageAppendResp = 20;
MsgStorageApply      = 21;
MsgStorageApplyResp  = 22;

There are ~3.5 groups of messages that need to be sent remotely (and only they need to be serialized):

// Replication flow. Needs fine-grained control for Replication AC.
MsgApp               = 3;
MsgAppResp           = 4;
MsgSnap              = 7; // can be better integrated with MsgSnapStatus

// Election flow.
MsgVote              = 5;
MsgVoteResp          = 6;
MsgPreVote           = 17;
MsgPreVoteResp       = 18;
MsgFortify & co

// Heartbeats. Can be deprecated in favour of the "liveness" API,
// which can be integrated with MsgUnreachable.
MsgHeartbeat         = 8;
MsgHeartbeatResp     = 9;

// MsgTimeoutNow has limited use. Triggers campaign, stepped on a few occasions:
//  - local call to force campaign bypassing proper transfer
//  - on MsgTransferLeader (which is a local call) when the transferee is caught up
//  - on MsgAppResp from the transferee when it's caught up
MsgTimeoutNow        = 14;
blathers-crl[bot] commented 3 weeks ago

cc @cockroachdb/replication

pav-kv commented 2 weeks ago

Message types by role {Proposer, Acceptor, Learner}:

Proposer:

// local
MsgHup               = 0;
MsgBeat              = 1;
MsgProp              = 2; // can also be received
MsgUnreachable       = 10;
MsgSnapStatus        = 11;
MsgCheckQuorum       = 12;
MsgTransferLeader    = 13;
MsgTimeoutNow        = 14;

// sends
MsgVote              = 5;
MsgPreVote           = 17;
MsgApp               = 3;
MsgSnap              = 7; // should this be in (distinguished) Learner?
MsgHeartbeat         = 8;

// receives
MsgVoteResp          = 6;
MsgPreVoteResp       = 18;
MsgAppResp           = 4;
MsgHeartbeat         = 8; // prevents Proposer from campaigning
MsgHeartbeatResp     = 9; // responses from other Proposers/Acceptors

Acceptor:

// local
MsgForgetLeader      = 23;

// receives
MsgVote              = 5;
MsgPreVote           = 17;
MsgApp               = 3;
MsgHeartbeat         = 8;

// sends
MsgVoteResp          = 6;
MsgAppResp           = 4;
MsgPreVoteResp       = 18;
MsgHeartbeatResp     = 9;

// local storage
MsgStorageAppend     = 19;
MsgStorageAppendResp = 20;

Learner:

// receives
MsgApp               = 3;
MsgSnap              = 7;
// sends
MsgAppResp           = 4;

// local storage
MsgStorageApply      = 21;
MsgStorageApplyResp  = 22;