kv: consistent follower reads with leaseholder coordination

nvanbenschoten commented 2 years ago

To date, follower reads (rfc/follower_reads.md, rfc/follower_reads_implementation.md, rfc/non_blocking_txns.md) have always been viewed first and foremost as a tool to minimize latency for read-only operations. By avoiding all communication with a range's leaseholder, follower reads can help a transaction avoid cross-region communication, dramatically reducing latency. However, in order to avoid any coordination with the leaseholder, follower reads trade off some utility — they either require reads to be stale or writes to be pushed into the future. This limits the places where they can be used.

This issue explores an extended form of "consistent" follower read that can be used in more situations than "stale" follower reads but still requires synchronous fixed-size (with respect to data accessed) communication with the range's leaseholder, negating what we have traditionally viewed as the primary benefit of follower reads. It also explores the secondary benefits of follower reads that remain even if the leaseholder helps coordinate the read off of a follower.

Motivations

1) Network costs in public clouds are expensive. They are also asymmetric, with pricing dependent on the source and destination of data transfer. For example, we see from EC2's data transfer pricing page that cross-region transfer costs between $0.01-$0.02 per GB, cross-zone transfer costs $0.01 per GB, and intra-zone transfer is free. This asymmetric pricing provides a strong incentive to minimize the amount of data shipped across regions/zones, even if some communication across regions/zones is unavoidable. Recognizing that many clients often have a follower for a given range closer (in data transfer cost terms) than the leaseholder presents an opportunity for cost savings.

2) Load-based splitting and rebalancing can help spread out well-distributed load across a cluster of nodes. However, they cannot spread out hotspots that cannot be split into different ranges. For read-heavy hotspots, serving reads from followers replicas can provide a form of load-balancing. This is true even if the leaseholder is contacted at some point to facilitate the read from the follower, as long as the follower is the one performing the expensive portion of the read (e.g. reading from disk, sending the result set to the client over the network, etc.).

3) (Stretch motivation) In the future, followers may store data in a different layout than leaseholders (e.g. column-oriented instead of row-oriented), which may be better suited for large analytical-style reads. The data organization would exchange write performance for read performance, so it would be more appropriate for a follower by virtue of the fact that followers can apply log entries at a slower cadence than leaseholders (e.g. batching 100s of entries to apply at a time). Serving reads from follower replicas would allow these read-optimized followers to be used even for consistent reads.

High-Level Overview

The key idea here is that even if the leaseholder is contacted during a read, it doesn't need to be the one to serve the read's results. Instead, it can be contacted to do some light bookkeeping and then offload the heavy-lifting to a follower replica who may be a better candidate to serve the data back to the client.

For the sake of this issue, let's pretend we introduced a new request type called EstablishResolvedTimestamp (a sibling to the QueryResolvedTimestamp request).

In response to an EstablishResolvedTimestamp request, the leaseholder would concern itself with concurrency control and with determining how far the follower needs to catch up on its Raft log before its state machine contains a fully resolved view of the specified span x timestamp segment of "keyspacetime". Morally, the leaseholder would be in charge of creating a resolved timestamp over the given key span at the given timestamp. So the API would look something like this:

type EstablishResolvedTimestampRequest struct{ Transaction, Timestamp, Span }
type EstablishResolvedTimestampResponse struct{ LeaseAppliedIndex }

With this new API, followers can now be used to serve consistent follower reads. Either of the following appeaches would work here, and each has their own benefits:

Follower-coordinated

client issues scan/get to nearest follower replica with ts
follower checks closed timestamp against ts, determines its closed timestamp is not high enough
follower sends EstablishResolvedTimestampRequest to leaseholder
leaseholder grabs latches, checks lock table, bumps timestamp cache over span, and notes current lease_applied_index
leaseholder returns EstablishResolvedTimestampResponse with lease_applied_index
follower waits to apply log entry with lease_applied_index >= one from response
follower reads serves read and returns to client

Benefits:

simpler
minimal API changes
can avoid leaseholder hop if closed timestamp happens to be high enough
can short-circuit leaseholder hop if closed timestamp increases while EstablishResolvedTimestamp outstanding, which can be helpful for stale reads as well

Client-coordinated

client issues EstablishResolvedTimestampRequest to leaseholder
leaseholder grabs latches, checks lock table, bumps timestamp cache over span, and notes current lease_applied_index
leaseholder returns EstablishResolvedTimestampResponse with lease_applied_index
client redirects to follower with lease_applied_index
follower waits to apply log entry with lease_applied_index >= one from response
follower reads serves read and returns to client

Extended client-coordinated

client issues scan/get to leaseholder replica with some establish_resolved_timestamp_on_large_result flag
leaseholder grabs latches, checks lock table, bumps timestamp cache over span, evaluates
leaseholder determines if result is small or large. If small, return. If large, return lease_applied_index
client redirects to follower with lease_applied_index
follower waits to apply log entry with lease_applied_index >= one from response
follower reads serves read and returns to client

Benefits:

lazy determination of single-hop vs. multi-hop, based on actual result size instead of guess

Additional unstructured notes:

- the Transaction is needed in EstablishResolvedTimestampRequest for deadlock detection
- if an EstablishResolvedTimestampRequest is scanning the entire range, it can also bump the closed timestamp
- the EstablishResolvedTimestampResponse could carry an observed timestamp to help avoid some uncertainty restarts
- uncertainty works as expected on follower
-- it *does not* need to resolve up to uncertainty interval
-- it knows that any causal predecessor will have been included in a log entry with <= lease_applied_index
- read-your-writes in read-write txn works as expected on follower
-- any intent writes will be flushed during pipeline stall on leaseholder and will have been included in a log entry with <= lease_applied_index
- how do limited scans play into this?
- how does an actor tail the log and wait for a lease_applied_index?
-- what if it sees a split? or a replica removal?
- when a follower is waiting to apply, is liveness guaranteed?
-- Does it ever need to wake up the range from quiescence or ensure an active leaseholder?

Jira issue: CRDB-11223

Epic CRDB-14991

ajwerner commented 2 years ago

Do sql pods know their AZ and the AZ of sql nodes? I assume they know the latter thing, or, at least, have node descriptors which give them some info. Do we need to plumb more info into the sql pods to help facilitate this?

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

ajd12342 commented 11 months ago

Hi @nvanbenschoten ! I am Anuj Diwan, a Computer Science PhD student at UT Austin. I am part of a team along with @arjunrs1 (Arjun Somayazulu) and we're taking a graduate Distributed Systems course. For our course project, we are interested in contributing to CockroachDB. This issue is related to our course material. Could we work on this issue? Any pointers for us to get started would be appreciated as well.

Thanks and regards, Anuj.

cockroachdb / cockroach