Open satherton opened 4 years ago
@satherton could you please provide an example to illustrate the problem and the proposed solution, in particular, for this one return a partial yet empty result
?
@xumengpanda
Let's say you have a range (a, d)
with millions of KV pairs
If every key in the range (a, c)
is deleted over some short period of time, in some storage engine designs (an LSM, for example) this could result in creation of a tombstone for every key found in (a, c)
.
Until some sort of compaction / deferred cleanup is done, reads will see both the KV pairs and their tombstones which nullify them so although none of the deleted records will be returned to the user they will be scanned by the storage engine which takes time.
Currently, only non-empty partial results can be represented by GetKeyValuesReply
, so progress of this scan cannot be checkpointed unless it reaches at least 1 record which still exists. In the example above, the first existing record which still exists would be at or after c
, so if the client does not wait for the (a, d)
scan to reach that point then no progress is checkpointed and a timeout / retry would start over again with a
.
Note that RangeResultRef
does support an empty result with a key to continue reading from, but the underlying GetKeyValuesReply
does not. In practice, an empty RangeResultRef
with a readThrough
key set would only be generated when readThrough
is a shard boundary key.
This should be done when we are also implementing predicate pushdown
GetKeyValuesReply has a
more
flag which, if set, indicates there may be more results after the last KV pair, and thus it requires at least one KV pair to be in the results set.If, however, the underlying storage engine requires a slow scan of an empty range (matching up KV pairs with recently created tombstones, for example) it may be necessary to return a partial yet empty result.
One way to facilitate this would be to change
bool more
toOptional<KeyRef> continueKey
to decouple the concept of a partial result from a non-empty one.
This is probably only useful for range reads which can be split across transaction boundaries.