Open nvanbenschoten opened 3 months ago
This is unfortunate, as it mutates the RawNode without holding the RaftMu, which is a pattern that we'd like to avoid.
As of today, mutating RawNode
without raftMu
seems like a desirable pattern - we want to be only holding mu
when updating the RawNode
in memory. The problem though is that stepping proposals into RawNode
causes eager MsgApp
construction for leader->follower replication flows, which potentially reads from raft.Storage
(for the followers whose replication flow is lagging) - but we don't want IO under mu
. See this doc.
With #130955 and #128779, this IO should be eliminated (and postponed until raftMu
-holding Ready
handler, or a similar RACv2 handler).
An alternative implementation would be to resize the buffer when it becomes full, instead of flushing.
Growing the proposal buffer is an option (and indeed would make the code cleaner), though it would be less needed with lazy MsgApp
construction / "raft pull mode". There would be no material difference between buffering the writes vs stepping each one to RawNode
immediately. Except maybe the benefits of MPSC queue vs exclusive mu
locking.
The flip-side of an ever-growing buffer is sensitivity to the risk that Ready
flushing runs with a big delay. So we also need to make sure that there is a higher-level protection against OOM here (partially achieved by the already existing quota pool, and RACv2 in the future); and that the proposal batches flushed to raft (which might become bigger with this change) don't get rejected due to size policies (which we might want to disable/remove eventually).
If we do this, we'll probably want to remove propBufArrayMaxSize.
So having some upper bound on the buffer size might be still desirable even if we learn to grow it instead of flushing. In short term at least, while we make all the mechanisms / policies not fight each other.
An alternative to the slice-based propbuf implementation would be an MPSC linked-list based queue, intrusive into the ProposalData
struct. To append to the queue, we could atomically swap in the new node (the ProposalData
which is already allocated) to the linked list. I have a prototype for this (after a clean-up that factors out MPSC queue generic component from propbuf so that we can iterate just on this bit), maybe will get to measure performance diff at some point.
The Raft
propBuf
currently has logic to flush into theRawNode
when it reaches its buffer capacity. This is unfortunate, as it mutates theRawNode
without holding theRaftMu
, which is a pattern that we'd like to avoid.An alternative implementation would be to resize the buffer when it becomes full, instead of flushing. This would require a rework of some of the locking and blocking in
propBuf.allocateIndex
, which is complicated by the atomic access toallocatedIdx
. Still, it seems possible and would likely leave the code in a cleaner state.If we do this, we'll probably want to remove
propBufArrayMaxSize
.Jira issue: CRDB-40943
Epic CRDB-39898