cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.51k stars 3.7k forks source link

kv: apply log entries outside of raft state machine loop #94854

Open nvanbenschoten opened 1 year ago

nvanbenschoten commented 1 year ago

Extracted from #17500.

After #94165, raft log entry disk writes are asynchronous with respect to the raft state machine loop. However, the (non-durable) engine access for state machine application is still performed inline. The async storage writes interface permits us to extract all of this work onto a separate goroutine.

This would provide three benefits:

  1. faster state machine loop iteration => less interference between entries => lower latency
  2. larger apply batches => more efficient state machine application => higher throughput
  3. flexible scheduling permits deferred application on followers => bigger batches, see benefit 2

Jira issue: CRDB-23189

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

sumeerbhola commented 1 year ago

Given state machine application very rarely syncs, we could have the state machine application happen in the same goroutine that is appending to the range's raft log. The benefit of unifying this scheduling is that if quorum is achieved with a lag of L log entries, state machine application will also only lag by L log entries relative the raft log append. @irfansharif and I have been discussing that this would simplify admission control tracking (since we want to minimize forecasting of the consequences of state machine application).

nvanbenschoten commented 1 year ago

@irfansharif and I have been discussing that this would simplify admission control tracking (since we want to minimize forecasting of the consequences of state machine application).

How does this relate to the pacing of state machine application during recovery, when entries have already been committed by a quorum of replicas but a straggler replica is catching up using the log? Is the plan to pace log replication and state machine application together, or independently?

irfansharif commented 1 year ago

We’d want to start off with just pacing log replication. To replicate already committed entries to the straggling replica, the leader+leaseholder replica would wait for flow tokens. Flow tokens would only be returned by the node being caught up once replicated entries have been admitted on the remote node. Admission of work on that node would take into account the apply time effect (using the same linear modelling we use today to go from the size of the log entries AC is informed of, to get to the L0 growth it actually observes).

BTW, the flow control discussions were mostly motivated by the recovery case where it’s likelier to build up a large amount of unapplied state.