Open abhimadan opened 5 years ago
Since DeleteRange and WritePrepared Transactions are developed in parallel we need to make sure that they are consistent with each other. One difference to watch out for is that in WritePrepared the prepare and commit sequence numbers are separate: the prepare sequence number is stored in DB with the key/value and commit is stored outside. I went through the code to figure the potential places that needs careful consideration:
@ajkr What do you think?
I read the code a bit but could not figure what is the significance of this sequence number and whether it is consistent with separation of prepare and commit sequence number introduced in WritePrepared transactions.
Unfortunately I don't know much about write-prepared to answer your question, but here's some more detail about that seqnum from the DeleteRange
side.
max_covering_tombstone_seq
is the seqnum of the newest range tombstone we've found so far that covers the user key Get()
is looking up. (Side note: I suppose it could've also been called first_covering_tombstone_seq
because Get()
looks at memtables and SSTs from newest-to-oldest, so after it finds one covering tombstone it will not find another one with a higher seqnum.)
It is used in the places you mentioned to check whether an internal key found in a memtable or SST is deleted by a range tombstone. It is also used to short-circuit the lookup process, e.g., https://github.com/facebook/rocksdb/blob/3c5d1b16b129de040e1adaca89c6957c0fb3ace8/db/version_set.cc#L1224-L1228
The above logic exploits how we examine memtables/SSTs in newest-to-oldest order. If any covering range tombstone was found in an earlier memtable or SST, then it's definitely newer than any internal key we might find now or later in the lookup process, so we can give up early.
::ShouldDelete that checks against the deleted ranges, should only care about user key and ignore the sequence number. However it uses internal key comparator instead of user comparator:
I always have trouble remembering this one. Luckily @abhimadan documented it here: https://github.com/facebook/rocksdb/wiki/DeleteRange-Implementation#internal-key-range-tombstone-boundaries.
Thanks @ajkr for the clear explanation. It was very helpful. I update the issue description will the list of required changes that we have figured so far.
@ajkr One more question, I do not see MaxCoveringTombstoneSeqnum
taking into account the snapshot number with which we are doing the read. How does it know if the DeleteRange Tombstone is in the read snapshot (and not newer than that)?
@maysamyabandeh The FragmentedRangeTombstoneIterator
has some internal logic preventing it from considering range tombstones newer than this read_seq
: https://github.com/facebook/rocksdb/blob/89ab1381f8efe38002f43ecbeb56aa8451529436/db/memtable.cc#L436. That read_seq
appears to come from here in the case of Get()
: https://github.com/facebook/rocksdb/blob/89ab1381f8efe38002f43ecbeb56aa8451529436/db/db_impl.cc#L1363-L1392
Any updates on this?
We still need this in 2021 :-/
Any update on this? We still need this in 2023 :-/
Would love this for 2024 😢
Now that DeleteRange is performant, it would be nice to support it in transactions as well. Summary of the essential changes: