facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.28k stars 6.28k forks source link

Support DeleteRange in transactions #4812

Open abhimadan opened 5 years ago

abhimadan commented 5 years ago

Now that DeleteRange is performant, it would be nice to support it in transactions as well. Summary of the essential changes:

  1. Finding the last visible RangeDelete tombstone needs to be updated to make use of ReadCallback::IsVisible same as the regular reads. https://github.com/facebook/rocksdb/blob/106a94af1552ba6eb89cfe08fd4f92c2078af463/db/range_tombstone_fragmenter.cc#L311
  2. WritePrepared tests assumed the key cannot be RangeDelete. The tests must be updated to add RangeDelete type anywhere Delete and SingleDelete is used: https://github.com/facebook/rocksdb/blob/master/utilities/transactions/write_prepared_transaction_test.cc#L513
  3. Finally the existing tests for range delete should be extended to include transactions (extended with ::DeleteRange) API as well.
maysamyabandeh commented 5 years ago

Since DeleteRange and WritePrepared Transactions are developed in parallel we need to make sure that they are consistent with each other. One difference to watch out for is that in WritePrepared the prepare and commit sequence numbers are separate: the prepare sequence number is stored in DB with the key/value and commit is stored outside. I went through the code to figure the potential places that needs careful consideration:

  1. There is a TODO here for duplicate key detection: https://github.com/facebook/rocksdb/blob/52e6404e0f72b6f994a442084415f518f9ec9f6c/db/write_batch.cc#L1435 to include the end key but I think it is fine since the end key is encoded in the value, thus not causing duplicate key issue.
  2. In Memtable::Get there is a comparison of seq to max_covering_tombstone_seq which is obtained from DeleteRange tombstone. https://github.com/facebook/rocksdb/blob/52e6404e0f72b6f994a442084415f518f9ec9f6c/db/memtable.cc#L648 I read the code a bit but could not figure what is the significance of this sequence number and whether it is consistent with separation of prepare and commit sequence number introduced in WritePrepared transactions. Same goes for GetContext: https://github.com/facebook/rocksdb/blob/52e6404e0f72b6f994a442084415f518f9ec9f6c/table/get_context.cc#L205
  3. WritePrepared tests assumed the key cannot be RangeDelete. The tests must be updated to add RangeDelete type anywhere Delete and SingleDelete is used: https://github.com/facebook/rocksdb/blob/master/utilities/transactions/write_prepared_transaction_test.cc#L513
  4. ::ShouldDelete that checks against the deleted ranges, should only care about user key and ignore the sequence number. However it uses internal key comparator instead of user comparator: https://github.com/facebook/rocksdb/blob/3c5eed5ebee8eb06425e8716e1a56fb9782ce9a9/db/range_del_aggregator.cc#L210 We need to clarify why so that the sequence number, that is interpreted differently in write prepared transactions, is not used meaningfully here.
  5. Finally the existing tests for range delete should be extended to include transactions (extended with ::DeleteRange) API as well. We can discuss these items here, see which one could be potential issue, and specify what exactly needs to be done to extend transactions with DeleteRange.

@ajkr What do you think?

ajkr commented 5 years ago

I read the code a bit but could not figure what is the significance of this sequence number and whether it is consistent with separation of prepare and commit sequence number introduced in WritePrepared transactions.

Unfortunately I don't know much about write-prepared to answer your question, but here's some more detail about that seqnum from the DeleteRange side.

max_covering_tombstone_seq is the seqnum of the newest range tombstone we've found so far that covers the user key Get() is looking up. (Side note: I suppose it could've also been called first_covering_tombstone_seq because Get() looks at memtables and SSTs from newest-to-oldest, so after it finds one covering tombstone it will not find another one with a higher seqnum.)

It is used in the places you mentioned to check whether an internal key found in a memtable or SST is deleted by a range tombstone. It is also used to short-circuit the lookup process, e.g., https://github.com/facebook/rocksdb/blob/3c5d1b16b129de040e1adaca89c6957c0fb3ace8/db/version_set.cc#L1224-L1228

The above logic exploits how we examine memtables/SSTs in newest-to-oldest order. If any covering range tombstone was found in an earlier memtable or SST, then it's definitely newer than any internal key we might find now or later in the lookup process, so we can give up early.

ajkr commented 5 years ago

::ShouldDelete that checks against the deleted ranges, should only care about user key and ignore the sequence number. However it uses internal key comparator instead of user comparator:

I always have trouble remembering this one. Luckily @abhimadan documented it here: https://github.com/facebook/rocksdb/wiki/DeleteRange-Implementation#internal-key-range-tombstone-boundaries.

maysamyabandeh commented 5 years ago

Thanks @ajkr for the clear explanation. It was very helpful. I update the issue description will the list of required changes that we have figured so far.

maysamyabandeh commented 5 years ago

@ajkr One more question, I do not see MaxCoveringTombstoneSeqnum taking into account the snapshot number with which we are doing the read. How does it know if the DeleteRange Tombstone is in the read snapshot (and not newer than that)?

ajkr commented 5 years ago

@maysamyabandeh The FragmentedRangeTombstoneIterator has some internal logic preventing it from considering range tombstones newer than this read_seq: https://github.com/facebook/rocksdb/blob/89ab1381f8efe38002f43ecbeb56aa8451529436/db/memtable.cc#L436. That read_seq appears to come from here in the case of Get(): https://github.com/facebook/rocksdb/blob/89ab1381f8efe38002f43ecbeb56aa8451529436/db/db_impl.cc#L1363-L1392

feliwir commented 4 years ago

Any updates on this?

feliwir commented 2 years ago

We still need this in 2021 :-/

chenyukang commented 10 months ago

Any update on this? We still need this in 2023 :-/

koudelka commented 8 months ago

Would love this for 2024 😢