storage: separated lock-table keyspace

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30k stars 3.79k forks source link

storage: separated lock-table keyspace #41720

Closed nvanbenschoten closed 2 years ago

nvanbenschoten commented 4 years ago

Summary

Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.

Motivation

Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.

Here's what this looks like visually for an intent at key "b":

/Table/"a"/123: {val1}
/Table/"b"/0:   {txn: txn1, ts: 999}
/Table/"b"/999: {val2}
/Table/"b"/500: {val3}
/Table/"c"/700: {val4}

We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".

Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:

can discover locks and read data without needing to read from two different iterators.
... that's it?

I propose we move it to a separate "lock-table" keyspace for all of the following reasons:

simplifies MVCC logic significantly. A scan can check for conflicts first before looking at data. After this, scanning the data keyspace doesn't need to be aware of transaction isolation at all. We shouldn't even need to push transaction information into MVCC methods, just the read/write timestamp. This complexity has bitten us in the past and continue to (see time-bound iterator complications)
avoids cluttering the MVCC keyspace with RocksDB deletion tombstones at timestamp zero of each key. This slows down scans
opens up the possibility to "blind-write" into the MVCC keyspace (with cooperation from the write timestamp cache if we need to maintain single-key linearizability). This would be a significant win for write-heavy workloads into large LSMs
opens up the possibility to store multiple write intents on a single key (see #5861)
makes it significantly cleaner to store point and ranged read locks
makes it easier to ignore closed timestamp/timestamp cache when a transaction is updating its own intent (see https://github.com/cockroachdb/cockroach/issues/36478#issuecomment-521903404)
makes it easier to augment the durable lock-table with a cache of known committed transactions, which might allow us to resolve intents (release locks) without holding latches. This would reduce the contention footprint of transactions (see #22349)
makes optimizations like using SingleDelete when resolving intents easier to make in isolation (see #8979)
makes scanning the intents on a Range cheap. Multiple other proposals have wanted this access pattern to be improved
@bdarnell believes this would be good for filling up SSTs in the local keyspace

In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.

Strawman Proposal

Remove Txn and IntentHistory fields from MVCCMetadata. Remove isolation concerns from MVCC logic.
Create new range-local lock-table keyspace. Store locks at keys with the structure: /Local/<key>/lock/<txnID>. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts.
Adjust read and write path to check lock-table after acquiring latches. The spans declared for latches would probably be the right input to check for conflicting locks.
Address one or more of the improvements listed above that can be made with this new structure.

This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.

gz#9005

petermattis commented 4 years ago

This complexity has bitten us in the past and continue to (see time-bound iterator complications)

This sentence appears to be incomplete. Many of the time-bound iterator complications stem from deletion of the metadata key not carrying any timestamp information. How would this proposal address that?

In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance.

We could test this hypothesis by changing the existing MVCC code to always do a separate seek for the metadata and version key (I'm thinking of MVCCPut here). I suspect the performance hit will be higher than you expect. This would be the worst case, though. I'd be curious to know what this performance hit is so we could judge how many of the improvements need to be done to offset it.

Strawman Proposal

The interleaving of the metadata records with the versioned records is embedded in a lot of MVCC code. It would be worthwhile to enumerate the places that need to get adjusted, such as MVCCComputeStats, MVCCScan, MVCCPut, etc. Many of those routines take a single iterator. With a segregated lock table we'd need them to take two iterators, but presumably at the same snapshot (which is a bit of overhead with RocksDB, though something we could solve with Pebble).

nvanbenschoten commented 4 years ago

How would this proposal address that?

This proposal would get rid of (non-synthetic) metadata keys outside of their use for inline kvs (which have no timestamp). Transaction isolation would be completely separate from the MVCC keyspace. A user of time-bound iterators (along with all other requests) would first check for locks at their desired timestamp. This would all be done based on the declared spans of the request. If none exists, they could scan with a time-bound iterator without needing to worry about whether they should observe intents or not.

With a segregated lock table we'd need them to take two iterators, but presumably at the same snapshot (which is a bit of overhead with RocksDB, though something we could solve with Pebble).

My strawman should have made it more clear that we wouldn't be trying to merge iterators or keep the current MVCC logic. We would check for locks in the lock table before doing the scan. We wouldn't need to scan at the same RocksDB snapshot because we already have in-mem latches to coordinate conflicting requests.

The only case where this would result in different semantics is with limited scans. Right now, a limited scan will ignore an intent if it never reaches it. This proposal as is wouldn't know that it can avoid an intent that it never reaches. We already have this issue with latches though, so I don't think this is really a new issue (see https://github.com/cockroachdb/cockroach/issues/9521 + https://github.com/cockroachdb/cockroach/pull/33373).

petermattis commented 4 years ago

My strawman should have made it more clear that we wouldn't be trying to merge iterators or keep the current MVCC logic. We would check for locks in the lock table before doing the scan. We wouldn't need to scan at the same RocksDB snapshot because we already have in-mem latches to coordinate conflicting requests.

That works for MVCC scan, but what about MVCC put?

nvanbenschoten commented 4 years ago

It would be roughly the same thing. We'd check for conflicting locks in the lock table before performing the put. If there were no conflicts then the put would write the provisional version kv and add a lock to the lock table. This is how we could get blind writes into the MVCC keyspace. Does that answer your question or am I missing something?

One complication here is when a transaction updates its own write. We'd need to store the sequence history on the lock itself.

petermattis commented 4 years ago

Does that answer your question or am I missing something?

Yes, though I was thinking about the complication you mention below.

One complication here is when a transaction updates its own write. We'd need to store the sequence history on the lock itself.

I was imagining this complication as well. Essentially you're proposing to decompose MVCC operations into lock checking phases and then execution phases. With MVCC put we may need to be passing info from the lock check phase to the execution phase. I only vaguely see how this would be done, though it doesn't seem infeasible.

sumeerbhola commented 4 years ago

@nvanbenschoten Regarding "This proposal as is wouldn't know that it can avoid an intent that it never reaches. We already have this issue with latches though ..." IIUC, latch lifetime is always short, while write intents will live for the duration of the transaction which could be long. Will this change imply one of (a) evaluate the limited scan and then see if it picked up a "value under lock", or (b) pass all the locks down to the limited scan so it can return early if it is seeing a "value under lock"?

nvanbenschoten commented 4 years ago

latch lifetime is always short, while write intents will live for the duration of the transaction which could be long.

Latch lifetime is often on the order of the latency of replication, so it's not always short. In fact, in a lot of cases, it's roughly the same as the lifetime of a transaction.

Will this change imply one of (a) evaluate the limited scan and then see if it picked up a "value under lock", or (b) pass all the locks down to the limited scan so it can return early if it is seeing a "value under lock"?

I wasn't imagining the change trying to do either of those proposals immediately, but they seem like good follow-on optimizations.

nvanbenschoten commented 4 years ago

Here's are a few other thoughts/clarifications from last week that are now floating somewhere around the bay area.

(1) One of the largest remaining abstraction violations of this proposal (which @petermattis picked up on above) is that a transaction will need a delicate interaction with its locks when rewriting intents and when reading below the latest sequence number. This is because we'll likely need to continue storing the intent history on the lock key, which will now move to the lock table keyspace. This could all be made simpler if we included a sequence number suffix in the MVCC keyspace directly. If we did this then keys would look like key/timestamp/sequence.

With this structure, we would then be able to let a transaction write multiple versions of a key directly into the keyspace and not require any coupling with an intent history when reading or writing. If a transaction was reading at an earlier sequence then it would naturally observe only sequence numbers up to the sequence it desires. If we did this then we wouldn't even need to remove old sequence numbers during intent resolution, although I think we'd want to in practice. I think @andy-kimball's AcidLib did something like this, and I can see why.

The big thing I haven't figured out here is how to deal with timestamp collisions between txns. To get the desired effect, I think we would need the portion of the keyspace before the sequence to be unique to a given transaction so that a transaction's own sequence number would never influence the values it can observe from other transactions. Something like key/timestamp/txnid/sequence would do the trick. Of course, this would be too expensive to do if the txnid was a UUID.

This would also be another migration, although it's slightly less daunting than the main proposal here because we could just default missing seqnums to 0.

(2) This proposal would naturally lead to us releasing read latches before performing the read itself. A read could simply grab a read latch, check locks (triggering contention handling if necessary), bump the timestamp cache, and then grab a RocksDB iterator before dropping its latches. The actual MVCC iteration could be done completely outside of latches. This would address a question that @andreimatei has had for a long time.

(3) Allowing intents to be resolved without latching would be beneficial for a few reasons that are listed above. The biggest of these is that conflicting transactions wouldn't need to wait for the intent resolution to go through Raft before proceeding.

In addition to this, I think the change would have another benefit which could simplify conflict resolution. Right now, requests that hit WriteIntentErrors release their latches before handling the conflict. They then push the conflicting transaction or enter the contentionQueue.

If we could efficiently check for intents/locks before evaluation then I think we could have tighter interaction between latching and locking and simplify the transaction pushing interaction. After a transaction has acquired latches it would check for conflicting locks. If it finds any, it could enter a contentionQueue-like queue before dropping its latches and pushing. This would end up working like a condition variable, where the condition is !conflicting_locks && !conflicting_contention_queues and the condition variable's lock is the latch.Manager.

This came up recently in a SELECT FOR UPDATE prototype. Maintaining an in-memory lock structure for "upgrade locks" had the desired effect of reducing transaction retries. However, I was also seeing it push out tail latencies. This appeared to be due to fairness issues around the contentionQueue. Because requests don't maintain latches while in the queue, other requests can slip in ahead of the head of the queue after the transaction holding the lock finishes. This led to a tremendous amount of barging. Some transactions would queue up in the contentionQueue while others would slip by and grab the lock, leading to a fast class of txns and a slow class of txns. Tying together latching and conflict resolution would avoid these issues and improve fairness while also making the interactions easier to think about.

petermattis commented 4 years ago

This would also be another migration, although it's slightly less daunting than the main proposal here because we could just default missing seqnums to 0.

How would we encode the seqnum? MVCC keys are currently encoded as:

<key>\x00[<wall_time>[<logical>]]<#timestamp-bytes>

The last byte in the key is either always 0, 8, or 12, indicating no timestamp, walltime-only, or walltime+logical. We can't easily append anything else to this format without rewriting all of the keys. I think we can add an optional sequence number. This would give the following meanings to the timestamp-bytes value:

0 - no timestamp or txn-seqnnum
8 - walltime, no txn-seqnum
12 - walltime+logical, no txn-seqnum
16 - walltime+txn-seqnum
20 - walltime+logical+txn-seqnum

I think that works. Not exactly what I'd design if starting from scratch, but it does mean we wouldn't have to rewrite all of the data on a store for a migration.

nvanbenschoten commented 4 years ago

@petermattis That backward-compatible encoding does seem workable if we go this route. At the moment though, I don't think the rest of the idea is fully workable. The primary blockers to it are:

The "timestamp collisions between txns" issue mentioned above
How read-your-writes would interact with transactions that read and write at different timestamps.

The way I have this working in my prototype is that we have a lockAwareIterator. This iterator can be thought of as a merging iterator pointing into MVCC keys and into the lock table, but which has a few fast-paths for transactions with 0 or a small number of locks (discovered during lock checking). The iterator is configured with a roachpb.Transaction (for the id, epoch, seq_num, and read_timestamp). When MVCC code iterates over a provisional value owned by the reading transaction and associated with a lock in the lock table, the iterator performs a translation step which:

logically moves the provisional value to the read timestamp. This ensures that transactions always read-their-writes in MVCC code, even if they don't realize that they are their own
replaces the value with a value from the sequence history if necessary, based on the seq nums of the lock and the reading transaction

makes it easier to augment the durable lock-table with a cache of known committed transactions, which might allow us to resolve intents (release locks) without holding latches. This would reduce the contention footprint of transactions (see #22349)

The idea here would have reduced the contention footprint of transactions from 3 rounds of distributed consensus down to 2. We could mark locks as COMMITTED before going through the process of actually removing them. Other transactions would then ignore these locks during their lock checking phase. The round of Raft to remove the locks could then be done without holding latches.

There would need to be a translation step for provisional values that are being moved to new timestamps during the intent resolution process. The lockAwareIterator could help out here as well. It could translate the provisional values for COMMITTED-but-not-removed locks to higher timestamps, if necessary.

Reducing the contention footprint from 3 writes to 2 is great, but @tbg just had a fantastic idea that would allow us to reduce it even further, down to just 1 write. The idea would address the longstanding desire to resolving intents while a transaction is implicitly committed (STAGING) instead of explicitly committed (COMMITTED). While in this phase, we could mark the lock as COMMITTED_BUT_NOT_REMOVABLE as soon as we're aware that the transaction's parallel commit has finished. Other transactions would ignore the lock just as before, but no-one would be allowed to remove the lock until the lock's owner became explicitly committed. In this way, we could get the lock out of the way of other transactions without removing evidence of an implicitly committed transaction.

sumeerbhola commented 4 years ago

@nvanbenschoten I don't fully understand the 3 rounds of distributed consensus and the optimizations to get it down to 2 rounds and then to 1 round. My understanding of parallel commits is that the 3 rounds are

transaction becomes STAGED (this is written with the last batch)
transaction becomes COMMITTED
intent resolution/lock removal

After the STAGED round and all the writes being successful, the client can be informed of the commit but there are two more rounds before intents are removed. Is that correct? Is the leaseholder for the range with the transaction record supposed to drive the intent resolution? And the second and third rounds cannot be concurrent since they may be at different ranges and if the third races ahead of the second and the second needs to be retried we would have lost the intents that confirmed that the STAGED transaction could be COMMITTED, and would incorrectly think it had failed and remove all its writes? IIUC, getting it down to 2 rounds relies on having multiple intents/locks, of which at most one has a transaction that is not COMMITTED. Marking these locks COMMITTED would happen in the second round after which new locks can be allowed. And the third round would remove these COMMITTED locks. And the final optimization marks these locks as COMMITTED(_BUT_NOT_REMOVABLE) without a round of consensus (this transition would be best-effort since it is not using raft), the second round of consensus would guarantee that these would be marked COMMITTED and the third round would remove them. Is my understanding correct?

nvanbenschoten commented 4 years ago

After the STAGED...

Yes, your understanding is correct for all three questions in this paragraph.

Marking these locks COMMITTED would happen in the second round after which new locks can be allowed. And the third round would remove these COMMITTED locks. And the final optimization marks these locks as COMMITTED(_BUT_NOT_REMOVABLE) without a round of consensus (this transition would be best-effort since it is not using raft)

Neither transition needs to go through Raft. This information could all be held in memory while we wait to be able to remove the locks and during the process of removing the locks. I don't think the locks would ever be marked as COMMITTED durably. The important part is that the leaseholder can begin letting new requests observe the result of the know-committed transaction immediately after it hears that the transaction is committed, even if it doesn't remove the lock immediately.

So getting the contention footprint down from 3 rounds to 2 is a clear win. Getting it down from 2 to 1 would require more communication ("a txn is committed so release its locks, but don't remove them"), so it's not obvious that it's a win in all cases, but it certainly would be in contended cases.

tbg commented 4 years ago

I don't know if this idea already came up somewhere (it looks like we've discussed it only for non-versioned keys), but if we had a key encoding scheme which contains the seqno and txnid, we could ensure that the versions are never overwritten and thus we'd be able to use SingleDelete when we eventually delete them. I don't know if this would really give us a lot of benefit, though, since versions are only deleted when GC runs, at which point the original write has probably sunk down in the LSM to some degree (definitely with today's 25h TTL).

petermattis commented 4 years ago

I don't know if this would really give us a lot of benefit, though, since versions are only deleted when GC runs, at which point the original write has probably sunk down in the LSM to some degree (definitely with today's 25h TTL).

There has been a small bit of water-cooler discussion about performing GC during compactions which would completely eliminate any overhead for GC'ing old data.

tbg commented 4 years ago

There has been a small bit of water-cooler discussion

It's here: https://github.com/cockroachdb/cockroach/issues/16132

sumeerbhola commented 4 years ago

@tbg Adding to what Peter said, GC during compactions is one of the ways we can exploit knowledge of the timestamp in the key at the Pebble level. It would need to know which time intervals are needed for backups etc. (this would be similar to how snapshot aware gc of sequence numbers works in RocksDB and Pebble), to know which "dead" versions can be thrown away. When there are dead and live versions in the same sstable one could track the amount of dead data in each sstable with a coarse age histogram. This could be used to trigger sstable rewrites (independent of normal compactions) based on a simple cost-benefit function. This won't clean all dead versions -- those that can only be known to be dead when looking at a different sstable that does not participate in compactions with this sstable will need to be deleted by a periodic scan of the key space. Another optimization could be to make these key scans cheap by segregating keys and values inside a file.

nvanbenschoten commented 4 years ago

I wonder if the ability to mark locks as non-active without removing them could allow us to get rid of the AbortSpan. A poisoning PushTxnRequest could mark the lock as ABORTED instead of removing it. It could then defer to the transaction itself to actually remove the lock. When scanning the lock table, transactions would take notice of whether their own locks have been ABORTED, which would provide the same protection against zombie transactions that the AbortSpan currently does.

If we got rid of the AbortSpan lookups then this change wouldn't even be introducing any new RocksDB iterations. cc. @andreimatei.

sumeerbhola commented 4 years ago

The segregated lock table also helps bulk operations, based on a discussion with @dt (that I've tried to capture below):

The timebounds property collector is not being used to constrain the sstables being read for an incremental backup because of the known bug that the deletion of the MVCCMetadata at timestamp 0 is not seen. This makes incremental backups more expensive. While there is a workaround implemented (but not enabled because of worries about bugs), the segregated lock table will make this simple: only constrain the data scan based on the sstable time bounds, and scan the whole segregated lock table (since it will be small).
When a backup encounters an intent it resolves it and repeats the work. So if it encounters n intents the cost is O(n^2) of the single data scan. One could reduce this to 2 data scans with one scan to find all the intents, resolve them and then do a second scan for the backup. This is simpler with the segregated lock table since one would first scan the lock table (this scan is cheap because the locks are segregated) and after resolving intents scan the data once.
For imports and new indices, that are going to add an sstable at a particular timestamp t:
- it needs to validate that there are no provisional values < t. This requires an expensive scan of the data. With the segregated lock table one only needs a cheap scan of the lock table.
- the add cannot be transactional since any concurrent transactions that have timestamp > t may miss the data being added. For imports this requires taking the table offline. With the lock table the import could acquire an exclusive lock on the span of the sstable being imported. Importing into a part of the table key space without taking the whole table offline is a feature that has been requested.

sumeerbhola commented 4 years ago

I was trying to think through two aspects of the transition for 20.2:

How will the format change work:
- Do it a range at a time: We may not be able to merge ranges where one range has fully transitioned and another has not so we could start from one end of the key space and transition ranges so that most range boundaries are homogenous.
- Cannot afford to reduce range availability so do the transition a replica at a time.
- A replica should not have to run with some intents in-band and some segregated. So do the transition by adding a learner with segregated lock table.
- Snapshot sstables sent to a segregated replica need to be transformed before applying to the state machine. Also need the inverse for snapshots sent from segregated replicas to in-band replicas. We could do these conversions in a 20.2 node so that the 20.1 nodes can be oblivious to this change.
- Raft behavior: the WriteBatch in a RaftCommand needs to be different in segregated and in-band replicas. This suggests the transformation needs to happen below Raft. We could make this transition start only when all replicas are running on 20.2 so at least the in-band replicas know about the compatibility issue. Should we split the WriteBatch into two parts for efficiency of the transformation -- to do the transformation, the code only has to look at the batch containing the locks and then merges the two batches before applying.
- This seems to also affect MVCCStats since the byte size of the locks will probably change.
API changes:
- The MVCC API no longer needs to know about transactions or intents. If we want (or need, based on what the format change requires) 20.2 to support both segregated and in-band locks, can we afford to change the API, since it will mean all callsites will need to work with both APIs?
- Alternatively, do we keep the same API with a different implementation that uses the segregated lock table and delay the cleanup? In this setup the sequencing in the concurrency package would use the segregated lock table but adding locks would happen via the MVCC interface.

Am I overcomplicating and is there a simpler way?

tbg commented 4 years ago

I agree that the migration is tricky.

What you're describing sounds like a big fork-lift, and thus fairly complicated. I think we can break it down more. All in all, it looks manageable to me.

First, MVCC will have to know about intents simply because the 20.2 binary will initially run in a cluster in which the replicated lock table is not a thing (remember, it has to have compat with 20.1). But we will want to make sure we can rip out the inline replicated intents in 21.1, which means that by the time the cluster has fully acked the 20.2 version upgrade, there should not be any inline intents left in the cluster.

We should be able to achieve this via a long-running migration (not available yet, but in my book a must-have for 20.2): after the version bump, all ranges start using the replicated lock table instead of putting intents inline, and long-running migrations gives us a hook to finalize the upgrade that sweeps through the inline keyspace and pushes (with a small grace period) any intent it finds. Note that this means that mixing between lock table replicated and inline replicated intents is possible. Based on my understanding of the lock table APIs today, that is fine (?). Each individual range is upgraded through Raft (remember, the intents are part of the replicated keyspace, so they must only be changed through consensus) via #42882 (this further informs that it could be dangerous to try to forklift the move from inline to lock table - it would have to happen in one raft command, whose size would be unbounded under adverse circumstances).

Snapshots always get in the way of a "get rid of legacy data" migration. I think long-running migrations (more precisely a version of #42882 that can run for longer, too) can make this manageable too, though. Snapshots never roll back raft applied state, so all we need to do is make sure orphaned replicas are eagerly GC'ed at some point in the cycle, which is something I'm aware of there.

Assuming we get all of this stuff right, MVCC can forget about inline intents in 21.1 (i.e. on master in late 2020), completing the migration.

sumeerbhola commented 4 years ago

Note that this means that mixing between lock table replicated and inline replicated intents is possible. Based on my understanding of the lock table APIs today, that is fine (?).

I think that is the key insight, and yes I think it should be doable. A new intent won't be added until after both the sequencing through the lock table (so a segregated intent would have been found), and request evaluation using the MVCC api that would find inline intents, so it is safe. And that new intent could be either segregated or inline. Resolving an intent either (a) needs to look in both places, or we would (b) plumb a bit about where it was found (for a push) or placed (when committing). (b) may have issues if there is a race with a push and a segregated intent was replaced with an inline intent or vice versa -- the push may "succeed" even though the intent is still there. But I think that can happen now too and does not affect correctness since the transaction will push again.

sumeerbhola commented 3 years ago

@irfansharif For migrating all state to separated intents, here is a rough sketch that needs fleshing out in the context of what has been built for long running migrations https://github.com/cockroachdb/cockroach/pull/48843

Starting state is (DisallowSeparatedIntents=true, enabledSeparatedIntents=false). These are described here https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/intent_reader_writer.go#L22-L48 (ignore that these are currently constants in the code).
Wait for all nodes in the cluster to be running 21.1, and no rollback to old versions.
Set DisallowSeparatedIntents=false.
Wait for all nodes in cluster see this change, so they know to read separated intents.
Set enabledSeparatedIntents=true. When nodes see this, they start writing separated intents. Some nodes will write separated and some interleaved. A separated intent can be rewritten as interleaved (when intent is updated, say due to a txn push) if the leaseholder moves from a node that was writing separated, to a node that is writing interleaved. This mixed behavior is ok.
Next state transitions for cluster:
- Partial Rollback: Set enabledSeparatedIntents=false. New and updated intents will be written as interleaved. There can still be some separated intents because of long-running txns or intents that were not resolved.
- Full Rollback:
- Set enabledSeparatedIntents=false.
- Wait for all nodes to see this so that they are writing interleaved intents.
- Wait for duration T, to minimize the number of separated intents in the system (since txns will complete and these intents will be resolved).
- Run migration to rewrite separated intents as interleaved.
- Wait until migration is finished for all ranges.
- Set DisallowSeparatedIntents=true.
- Finalize separated:
- Wait for duration T, to minimize the number of interleaved intents in the system (since txns will complete and these intents will be resolved).
- Run migration to rewrite interleaved intents as separated.
- Wait until migration is finished for all ranges.
- The cluster is now ready for 21.2, which will not have code to read or write interleaved intents, and will include optimizations that assume no interleaved intents.
What does the migration that rewrites interleaved to separated or vice versa do:
- Runs per range
- Scans for intents in the range without holding latches, creates a batch of transactions that need to be pushed, and pushes them.
- Should this be a PUSH_ABORT since we've already waited for T? Do we need to use a priority higher than MaxUserPriority to that it succeeds.
- The push can still fail so need to retry.
- Need throttling so that don't do this scanning for all ranges simultaneously.
Other considerations:
- Range snapshot from pre-migration shouldn't be used after the migration is complete.
- Orphaned replicas with pre-migration state?

sumeerbhola commented 3 years ago

@irfansharif and I chatted about it (Irfan, please add/correct if I missed something):

Long duration: This migration requires scanning all the data on a range, so we will be very careful with throttling. It is possible that the migration takes hours or days. The current scheme for pre-migration snapshots not showing up after migration is complete, is to bump the low-watermark version on each store before the migration executes. This is likely to be insufficient for such a long running migration since we want want to bar old snapshots only after we've made one pass and migrated all ranges. The second pass could bar old snapshots and pick up any stragglers. One option we discussed was to structure it as 2 migrations.
Below or above raft: The sketch above does this above raft, to avoid holding wide latches -- latches for intent resolution will be narrow. The migration will explicitly need to invoke the behavior where the gc queue is used to purge orphaned replicas since the default behavior is to only do this for below-raft migrations.
Rollback, opt-out etc.: we do want to remove interleaved intents code for 21.2, so opting all users in for separated intents is desired. But I think we need the ability to opt-out, and rollback after opting-in. This may need some metadata flexibility wrt migrations -- something like (dynamically assigned) migration number N corresponds to migration M, such that the history of migrations could look like: (10, migrate to separated), ..., (13, migrate to interleaved), ...(20, migrate to separated), ... where the first element in each tuple is a migration number.

@nvanbenschoten @tbg Do you have any thoughts on this and the rough migration plan outlined in the preceding comment https://github.com/cockroachdb/cockroach/issues/41720#issuecomment-743473275?

tbg commented 3 years ago

I don't think I was able to piece all of the details together from your last two comments, so I'm guessing a bit, apologies if something's off. My first question is, why do you need to "partially rollback" the separate lock table? Have we decided that we want that to be a requirement? I'll ignore that for now, but if we indeed want this (i.e. CRDB at full 21.1 and we still want to use inline intents) let me know, I think that's straightforward but means we can't set up the migration to enforce absence of inline intents, i.e. we don't get to delete the code in the next cycle.

I thought about the migration a bit. It is tricky!

I would think that this migration is split among two versions, EnableSeparateLockTable and NoMoreInlineIntents. When the first cluster version is active, nodes will be writing to the separate lock table. When it's active, nodes will write new intents via the separate lock table, but they will still need to be able to handle inline intents. That's the standard stuff we have had in CRDB for a long time.

The second version is what phases out inline intents and does the heavy lifting. In practice, we expect a lot of ranges to fall in one of the two buckets:

it's a timeseries range (which can't contain intents)
it has !ms.ContainsEstimates && ms.IntentCount==0

The goal of the migration is to set things up so that

a) no more inline intents can appear in the keyspace b) pass through the keyspace once and remove all inline intents

For a), since EnableSeparateLockTable being active implies that all new Raft proposals will use the lock table, all we need to do is flush the raft pipeline on each range once, which is most easily done via an almost-noop below-raft migration. This includes the round of replicaGC for any stray replicas, and protection against stale snapshots; I assume that both of this will be available as fallout of the TruncatedState below-raft migration and that we use it here, before we even start trying to phase out inline intents.

Ok, so we know that if we consider any span of the keyspace and look at the inline intents within it, that set will not see any future additions.

For the second part, it's important to note that nothing here is "per-range". We could simply do a full scan of the keyspace and resolve any inline intents we find, and that would be valid. But it would be slow, and by looking at things on a range-by-range basis, we can take shortcuts:

If it's wholly a timeseries range, it's free. If it's not and contains no estimates, and the intent count is zero, it's also free. Otherwise, i.e. it's likely a range seeing a steady drip of write traffic, we have to slowly scan it and push any inline intents we find out of the way. When that is done, we know that the keyspace we scanned is free of inline intents.

To run the actual scan, I think we need to introduce a new read-only (working-name) MigrateLockTable command that leaks an engine snapshot to the (local) caller. The reason to introduce a command is to get the proper serialization, i.e. a proof that the engine snapshot really contains the data for our span of interest; if we didn't do that the replica might've moved elsewhere and the span emptied. I don't know if MigrateLockTable even needs to acquire any latches; the previous migration should've already made all inline intents visible. It will probably want to grab a read latch on the range descriptor for sanity, but again I'm not sure it needs to.

The caller then slowly iterates through and resolves any inline intents.

By the time this has finished, the range may have split or merged, but that is fine - recall that we are proving something about the keyspace, and not individual ranges. The span we looked at is still inline-intent free, as per a).

Implementing the scan side of is mildly awkward. I think it's actually reasonable to peek into the responses here:

https://github.com/cockroachdb/cockroach/blob/32307340f401a2449d2d98f6f89a4efff6f20172/pkg/kv/kvserver/replica_send.go#L118-L117

and to trigger the scan code when we see a MigrateLockTableResponse. (Some bending over backwards will be required to pass the engine snapshot there; I think we'll just put it in a map with a unique key, put the unique key in the response, and the scan code can fish+remove it).

This will allow us to make MigrateLockTable a ranged read request (though one with weird latching) and to take advantage of all of the existing machinery to apply a range-spanning command while range boundaries may change. If we don't do this, we will basically have to rewrite a small DistSender that retries appropriately and keeps track of which parts of the keyspace haven't seen the command yet, which I think we should avoid.

After all of this, I think the nice thing will be that the migration could basically just run MigrateLockTable{Key: KeyMin, EndKey: KeyMax}, though of course in practice it wants to use the per-range shortcuts, so it's more like:

for {
  descs := getNext200RangeDescriptorsWithStats()
  for _, desc := range descs {
    if canUseShortcut(desc) {
      continue
    }
    migrateWorkerPool <- desc
  }
}

We should also finish things up with another round of MigrateRequest, to flush the Raft pipelines, which will allow us to remove MigrateLockTable itself in the next cycle already.

I know this is a lot to take in, let me know which parts need more elaboration or which alternatives we should consider. (I thought about quite a few alternatives, but it became too noisy to write them down; they were mostly about how to set up the MigrateLockTable request).

andreimatei commented 3 years ago

This proposal would naturally lead to us releasing read latches before performing the read itself. A read could simply grab a read latch, check locks (triggering contention handling if necessary), bump the timestamp cache, and then grab a RocksDB iterator before dropping its latches. The actual MVCC iteration could be done completely outside of latches. This would address a question that @andreimatei has had for a long time.

The big problem here is, I think, scans with limits for which we wouldn't know exactly what to put in the timestamp cache before evaluating them fully (right)? In other words, we'd go from "too wide latches" (#9521) to "too wide ts cache entries". No?

sumeerbhola commented 3 years ago

The big problem here is, I think, scans with limits for which we wouldn't know exactly what to put in the timestamp cache before evaluating them fully (right)? In other words, we'd go from "too wide latches" (#9521) to "too wide ts cache entries". No?

I believe you are correct. For limited scans we are going to stop checking latches and locks before the first evaluation attempt e.g. https://github.com/cockroachdb/cockroach/pull/58670

ajwerner commented 3 years ago

Now that the migration is checked in, is it time to close this issue?

lunevalex commented 2 years ago

Closing this, as it seems like we did all we are going to do in 21.2