cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.15k stars 3.81k forks source link

kv: grow propBufArray on full, instead of flushing #128313

Open nvanbenschoten opened 3 months ago

nvanbenschoten commented 3 months ago

The Raft propBuf currently has logic to flush into the RawNode when it reaches its buffer capacity. This is unfortunate, as it mutates the RawNode without holding the RaftMu, which is a pattern that we'd like to avoid.

An alternative implementation would be to resize the buffer when it becomes full, instead of flushing. This would require a rework of some of the locking and blocking in propBuf.allocateIndex, which is complicated by the atomic access to allocatedIdx. Still, it seems possible and would likely leave the code in a cleaner state.

If we do this, we'll probably want to remove propBufArrayMaxSize.

Jira issue: CRDB-40943

Epic CRDB-39898

pav-kv commented 1 month ago

This is unfortunate, as it mutates the RawNode without holding the RaftMu, which is a pattern that we'd like to avoid.

As of today, mutating RawNode without raftMu seems like a desirable pattern - we want to be only holding mu when updating the RawNode in memory. The problem though is that stepping proposals into RawNode causes eager MsgApp construction for leader->follower replication flows, which potentially reads from raft.Storage (for the followers whose replication flow is lagging) - but we don't want IO under mu. See this doc.

With #130955 and #128779, this IO should be eliminated (and postponed until raftMu-holding Ready handler, or a similar RACv2 handler).

An alternative implementation would be to resize the buffer when it becomes full, instead of flushing.

Growing the proposal buffer is an option (and indeed would make the code cleaner), though it would be less needed with lazy MsgApp construction / "raft pull mode". There would be no material difference between buffering the writes vs stepping each one to RawNode immediately. Except maybe the benefits of MPSC queue vs exclusive mu locking.

The flip-side of an ever-growing buffer is sensitivity to the risk that Ready flushing runs with a big delay. So we also need to make sure that there is a higher-level protection against OOM here (partially achieved by the already existing quota pool, and RACv2 in the future); and that the proposal batches flushed to raft (which might become bigger with this change) don't get rejected due to size policies (which we might want to disable/remove eventually).

If we do this, we'll probably want to remove propBufArrayMaxSize.

So having some upper bound on the buffer size might be still desirable even if we learn to grow it instead of flushing. In short term at least, while we make all the mechanisms / policies not fight each other.


An alternative to the slice-based propbuf implementation would be an MPSC linked-list based queue, intrusive into the ProposalData struct. To append to the queue, we could atomically swap in the new node (the ProposalData which is already allocated) to the linked list. I have a prototype for this (after a clean-up that factors out MPSC queue generic component from propbuf so that we can iterate just on this bit), maybe will get to measure performance diff at some point.