apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.39k stars 1.3k forks source link

GetKeyValuesReply changes to support slow realization of emptiness. #3618

Open satherton opened 4 years ago

satherton commented 4 years ago

GetKeyValuesReply has a more flag which, if set, indicates there may be more results after the last KV pair, and thus it requires at least one KV pair to be in the results set.

If, however, the underlying storage engine requires a slow scan of an empty range (matching up KV pairs with recently created tombstones, for example) it may be necessary to return a partial yet empty result.

One way to facilitate this would be to change bool more to Optional<KeyRef> continueKey

to decouple the concept of a partial result from a non-empty one.

This is probably only useful for range reads which can be split across transaction boundaries.

xumengpanda commented 4 years ago

@satherton could you please provide an example to illustrate the problem and the proposed solution, in particular, for this one return a partial yet empty result?

satherton commented 4 years ago

@xumengpanda Let's say you have a range (a, d) with millions of KV pairs

If every key in the range (a, c) is deleted over some short period of time, in some storage engine designs (an LSM, for example) this could result in creation of a tombstone for every key found in (a, c).

Until some sort of compaction / deferred cleanup is done, reads will see both the KV pairs and their tombstones which nullify them so although none of the deleted records will be returned to the user they will be scanned by the storage engine which takes time.

Currently, only non-empty partial results can be represented by GetKeyValuesReply, so progress of this scan cannot be checkpointed unless it reaches at least 1 record which still exists. In the example above, the first existing record which still exists would be at or after c, so if the client does not wait for the (a, d) scan to reach that point then no progress is checkpointed and a timeout / retry would start over again with a.

Note that RangeResultRef does support an empty result with a key to continue reading from, but the underlying GetKeyValuesReply does not. In practice, an empty RangeResultRef with a readThrough key set would only be generated when readThrough is a shard boundary key.

etschannen commented 4 years ago

This should be done when we are also implementing predicate pushdown