Open dt opened 8 years ago
See https://github.com/cockroachdb/cockroach/issues/4196#issuecomment-183003370
On Wed, Aug 31, 2016 at 10:45 AM David Taylor notifications@github.com wrote:
On #4321 https://github.com/cockroachdb/cockroach/issues/4321, @simkha https://github.com/simkha mentioned SingleDelete:
This info could be helpful for you. I was studying rocksdb sources and thought I could share this little info.
The following link points to a comment in CompactionIterator::NextFromInput(), the comment explains when presence of SingleDelete would compact out (i.e. clear) the stored key and the SingleDelete record. This iterator is used during meltable flush.
If you can adapt your code to meet the restrictions of SingleDelete, then I think, it could be useful.
https://github.com/facebook/rocksdb/wiki/Single-Delete
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/8979, or mute the thread https://github.com/notifications/unsubscribe-auth/AE135HPcrF3fzqoBMh68n2XHzck8DzB4ks5qlZOTgaJpZM4JxsgK .
-- Tobias
I think we might be able to use SingleDelete
for non-versioned keys such as the transaction keys. Currently, those keys are written multiple times (via Put
) and then deleted a single time. But we can change that so that we call SingleDelete
before overwriting. This would result in a series of Put, SingleDelete, Put, SingleDelete, Put, SingleDelete
operations for the key which I believe is kosher.
Interestingly, according to CompactionIterator
, currently, the keys deleted with DB::Delete()
will only be deleted during compaction, they will not be deleted during memtable flush (unlike SingleDelete
). You can tell this by compaction_ != nullptr
check.
For completeness, here's the call stack leading to CompactionIterator
during memtable flush:
- DBImpl::BackgroundFlush in db/db_impl.cc
- DBImpl::FlushMemTableToOutputFile
- FlushJob::Run -- db/flush_job.cc
- FlushJob::WriteLevel0Table
- BuildTable -- db/builder.cc
- CompactionIterator::Next
- CompactionIterator::NextFromInput -- db/compaction_iterator.cc
Note that DB::Delete()
d keys are probably deleted somewhere else in the code because body of the kTypeDeletion
case in CompactionIterator::NextFromInput
only drops the deletion entry (by skipping over it), not the key-value itself.
The following text from here seems to be pointing to what is the real difference between SingleDelete
and Delete
:
if you never overwrite existing keys, you can try to use DB::SingleDelete() instead of Delete() to kill tombstones sooner. Tombstones will be dropped after it meets the original keys, rather than compacted to the last level.
Expanding on @petermattis's comment https://github.com/cockroachdb/cockroach/issues/8979#issuecomment-243793107, we should explore whether SingleDelete
can be used for Raft log entries during log truncations. The impetus for this is that Raft log entries should almost never get very deep into the LSM because log truncations will often follow in close succession. Following this logic, it would be ideal if the deletion tombstones laid down by log truncations also avoided getting very deep into the LSM.
At the moment, this is not the case because log truncations use Delete
operations to clean up log entry keys. The effect of this is that even if log entries are removed from the LSM quickly, log entry tombstones stay in the LSM all the way until they hit the bottom. There are a few reasons why this is undersirable. One is that it contributes to write amplification and causes RocksDB to perform extra work. Another is that we've seen that if we're not careful, these tombstones can get in the way of other operations and cause significant slowdown. SingleDelete
operations would allow us to avoid both of these problems to some degree.
In order to use SingleDelete
operations for Raft log truncation, we'd have to ensure that no log entry is ever written to RocksDB twice without an intermediate SingleDelete
. This seems reasonable, as we have all the necessary information (e.g. prevLastIndex
) to enforce this invariant in Replica.append
. In fact, we already have an optimization that depends on us making the distinction between whether a log entry is new or whether it's overwriting an existing entry:
We'd also have to be careful about how SingleDelete
interacts with existing Delete
and RangeDelete
tombstones. For instance, at the moment it looks like SingleDelete
will actually remove Delete
tombstones, which is absolutely not what we want and something we'd have to be careful about if migrating to the use of SingleDelete
.
It's also possible that recent changes to DeleteRange
have made it the ideal choice for raft log truncation instead. We should experiment with both options.
Another interesting use of this is in cleaning up MVCCMetadata keys that are removed when resolving intents. If we were careful about updating these keys then we could use SingleDelete
operations to remove them.
The separated lock table uses SingleDelete for locks (MVCCMetadata) when possible. The remaining place where this could be used is for the raft log.
cc @cockroachdb/replication
@tbg on 2023-04-10:
Re-using this issue to focus exclusively on SingleDelete for the raft log, see https://github.com/cockroachdb/cockroach/issues/8979#issuecomment-414015919.
On #4321, @simkha mentioned
SingleDelete
and its handling in compactions:Jira issue: CRDB-6156
Epic CRDB-39898