cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

kvserver: investigate `SingleDelete` for raft log truncation #8979

Open dt opened 8 years ago

dt commented 8 years ago

@tbg on 2023-04-10:

Re-using this issue to focus exclusively on SingleDelete for the raft log, see https://github.com/cockroachdb/cockroach/issues/8979#issuecomment-414015919.


On #4321, @simkha mentioned SingleDelete and its handling in compactions:

The following link points to a comment in CompactionIterator::NextFromInput(), the comment explains when presence of SingleDelete would compact out (i.e. clear) the stored key and the SingleDelete record. This iterator is used during meltable flush.

https://github.com/facebook/rocksdb/blob/2fc2fd92a94dba8fbad5679fce4522c4769aab27/db/compaction_iterator.cc#L227-L260

If you can adapt your code to meet the restrictions of SingleDelete, then I think, it could be useful.

https://github.com/facebook/rocksdb/wiki/Single-Delete

Jira issue: CRDB-6156

Epic CRDB-39898

tbg commented 8 years ago

See https://github.com/cockroachdb/cockroach/issues/4196#issuecomment-183003370

On Wed, Aug 31, 2016 at 10:45 AM David Taylor notifications@github.com wrote:

On #4321 https://github.com/cockroachdb/cockroach/issues/4321, @simkha https://github.com/simkha mentioned SingleDelete:

This info could be helpful for you. I was studying rocksdb sources and thought I could share this little info.

The following link points to a comment in CompactionIterator::NextFromInput(), the comment explains when presence of SingleDelete would compact out (i.e. clear) the stored key and the SingleDelete record. This iterator is used during meltable flush.

https://github.com/facebook/rocksdb/blob/2fc2fd92a94dba8fbad5679fce4522c4769aab27/db/compaction_iterator.cc#L227-L260

If you can adapt your code to meet the restrictions of SingleDelete, then I think, it could be useful.

https://github.com/facebook/rocksdb/wiki/Single-Delete

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/8979, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135HPcrF3fzqoBMh68n2XHzck8DzB4ks5qlZOTgaJpZM4JxsgK .

-- Tobias

petermattis commented 8 years ago

I think we might be able to use SingleDelete for non-versioned keys such as the transaction keys. Currently, those keys are written multiple times (via Put) and then deleted a single time. But we can change that so that we call SingleDelete before overwriting. This would result in a series of Put, SingleDelete, Put, SingleDelete, Put, SingleDelete operations for the key which I believe is kosher.

ghost commented 8 years ago

Interestingly, according to CompactionIterator, currently, the keys deleted with DB::Delete() will only be deleted during compaction, they will not be deleted during memtable flush (unlike SingleDelete). You can tell this by compaction_ != nullptr check.

For completeness, here's the call stack leading to CompactionIterator during memtable flush:

- DBImpl::BackgroundFlush in db/db_impl.cc
- DBImpl::FlushMemTableToOutputFile
- FlushJob::Run -- db/flush_job.cc
- FlushJob::WriteLevel0Table
- BuildTable -- db/builder.cc
- CompactionIterator::Next
- CompactionIterator::NextFromInput -- db/compaction_iterator.cc
ghost commented 8 years ago

Note that DB::Delete()d keys are probably deleted somewhere else in the code because body of the kTypeDeletion case in CompactionIterator::NextFromInput only drops the deletion entry (by skipping over it), not the key-value itself.

ghost commented 7 years ago

The following text from here seems to be pointing to what is the real difference between SingleDelete and Delete:

if you never overwrite existing keys, you can try to use DB::SingleDelete() instead of Delete() to kill tombstones sooner. Tombstones will be dropped after it meets the original keys, rather than compacted to the last level.

nvanbenschoten commented 6 years ago

Expanding on @petermattis's comment https://github.com/cockroachdb/cockroach/issues/8979#issuecomment-243793107, we should explore whether SingleDelete can be used for Raft log entries during log truncations. The impetus for this is that Raft log entries should almost never get very deep into the LSM because log truncations will often follow in close succession. Following this logic, it would be ideal if the deletion tombstones laid down by log truncations also avoided getting very deep into the LSM.

At the moment, this is not the case because log truncations use Delete operations to clean up log entry keys. The effect of this is that even if log entries are removed from the LSM quickly, log entry tombstones stay in the LSM all the way until they hit the bottom. There are a few reasons why this is undersirable. One is that it contributes to write amplification and causes RocksDB to perform extra work. Another is that we've seen that if we're not careful, these tombstones can get in the way of other operations and cause significant slowdown. SingleDelete operations would allow us to avoid both of these problems to some degree.

In order to use SingleDelete operations for Raft log truncation, we'd have to ensure that no log entry is ever written to RocksDB twice without an intermediate SingleDelete. This seems reasonable, as we have all the necessary information (e.g. prevLastIndex) to enforce this invariant in Replica.append. In fact, we already have an optimization that depends on us making the distinction between whether a log entry is new or whether it's overwriting an existing entry:

https://github.com/cockroachdb/cockroach/blob/9f5ea5b34e00b4fd07da4f71c83e911bb0218097/pkg/storage/replica_raftstorage.go#L581-L585

We'd also have to be careful about how SingleDelete interacts with existing Delete and RangeDelete tombstones. For instance, at the moment it looks like SingleDelete will actually remove Delete tombstones, which is absolutely not what we want and something we'd have to be careful about if migrating to the use of SingleDelete.

It's also possible that recent changes to DeleteRange have made it the ideal choice for raft log truncation instead. We should experiment with both options.

nvanbenschoten commented 5 years ago

Another interesting use of this is in cleaning up MVCCMetadata keys that are removed when resolving intents. If we were careful about updating these keys then we could use SingleDelete operations to remove them.

sumeerbhola commented 3 years ago

The separated lock table uses SingleDelete for locks (MVCCMetadata) when possible. The remaining place where this could be used is for the raft log.

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication