The ABA problem prevention can be achieved by a more straightforward algorithm when we have the "last accepted term" tracking [#122690].
The need for preventing the ABA problem can be eliminated in the first place, if we delegate the unstable log to storage engine synchronously, and only assume flush/fsync is asynchronous [#122438].
We can remove various fields of raftpb.Message used only by this local protocol, to unclutter Message and make it smaller.
To support the new log write protocol, we are missing the notion of "leader term" - the term of the leader with whom our log is consistent. By raft invariants, all writes to, and acknowledgements from log storage are ordered by (leader term, index). Today, we approximate the "leader term" by using the last entry ID, but it complicates the protocol. It is also hard to reuse for Replication Admission Control because it requires remembering the unacknowledged entry IDs, but these are cleared from the unstable data structure as soon as the entries are written.
The plan is to support the "leader term" tracking:
[x] track the "leader term" / "last accepted term": #122690
[x] move the tracking to raftLog/unstable structures
[x] #126474
[x] #126475
[x] #126783
[x] convert MsgStorageAppend[Resp] messages to use the leader term
[x] #127424
[ ] optional: implement the fork tracking data structure to allow earlier acks when the leader term changes
[ ] convert MsgStorageAppend[Resp] messages protocol to a type-safe/ergonomic Ready/Response API.
The async log storage protocol (https://github.com/etcd-io/raft/pull/8) can be improved in a few aspects:
unstable
log to storage engine synchronously, and only assume flush/fsync is asynchronous [#122438].raftpb.Message
used only by this local protocol, to unclutterMessage
and make it smaller.To support the new log write protocol, we are missing the notion of "leader term" - the term of the leader with whom our log is consistent. By raft invariants, all writes to, and acknowledgements from log storage are ordered by
(leader term, index)
. Today, we approximate the "leader term" by using the last entry ID, but it complicates the protocol. It is also hard to reuse for Replication Admission Control because it requires remembering the unacknowledged entry IDs, but these are cleared from theunstable
data structure as soon as the entries are written.The plan is to support the "leader term" tracking:
raftLog/unstable
structuresMsgStorageAppend[Resp]
messages to use the leader termMsgStorageAppend[Resp]
messages protocol to a type-safe/ergonomicReady/Response
API.Epic: CRDB-37515 Jira issue: CRDB-38897