cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

kv,storage: run migration to strip synthetic bit from MVCC keys #129620

Open nvanbenschoten opened 2 months ago

nvanbenschoten commented 2 months ago

In https://github.com/cockroachdb/cockroach/pull/116830, we removed synthetic MVCC timestamps. To make this possible, we first landed https://github.com/cockroachdb/cockroach/pull/105523, which stopped encoding and decoding the synthetic timestamp bit in/from MVCC keys. This limited the reach of previously stored synthetic timestamps in long-lived clusters to just the MVCC decoding layer. We paired this with https://github.com/cockroachdb/cockroach/pull/117304, which avoided mixed-version cluster concerns during replica consistency checks (along with a number of other PRs listed in https://github.com/cockroachdb/cockroach/issues/101938).

However, we never ran a migration to strip the synthetic bit from persistent MVCC keys that have the bit set. We should, as keeping around these keys presents some risk. This will allow us to get rid of the key decoding logic which deals with the synthetic MVCC key suffix encoding. It will also allow us to get rid of the MVCC key suffix normalization logic in the key comparator and equality functions, which are needed to ensure that read+decode+encode+delete logic works correctly with synthetic timestamps.

Because of the previous work in this area, such a migration can be local to a single replica — it doesn't need to be replicated across all replicas in a range. This simplifies things meaningfully.

Jira issue: CRDB-41621

RaduBerinde commented 2 months ago

@jbowens mentioned that the new sstable format he is working on will not store the synthetic bit. This hasn't been decided but it is possible that we will turn it on by default in 24.3 and perhaps require all sstables to be in that format before upgrading to ~24.1~ 25.1.

On top of that, we can add a NormalizeSuffix function in the comparer that we apply on suffixes in all range keys that we write out during flushes and compactions (and perhaps on all keys in the WAL?). This would guarantee that when all sstables are in the 24.3 format, all suffixes would be normalized.

nicktrav commented 2 months ago

Linking this to #96819 as we'll likely want to think through the implications of these migrations on version upgrades.

jbowens commented 1 month ago

Because of the previous work in this area, such a migration can be local to a single replica — it doesn't need to be replicated across all replicas in a range. This simplifies things meaningfully.

Recording for posterity, that this is no longer true for range keys (MVCC range tombstones). Due to a bug in the Comparer, MVCC range tombstones with and without the synthetic bit are not logically equivalent. We attempted to correct the Comparer, but that predictably caused a replica divergence: https://github.com/cockroachdb/cockroach/issues/130533 A migration to remove the synthetic bit from MVCC range tombstones will need to be performed above Raft. We can continue with the plan to migrate point keys with the synthetic bit below Raft.

Related internal doc: https://docs.google.com/document/d/1nKgIG-3_7xv7YsgbSmtAfI6-1riokBn05hY1J1x_ikw/edit?usp=sharing

Schtick commented 1 month ago

Related incident (replica divergence): https://cockroachlabs.atlassian.net/wiki/spaces/IM/pages/3876585524/2024-09-10+crl-prod-j6q+replica+inconsistency