Open scottfines opened 3 years ago
Some more debugging:
by putting some TraceEvent
calls right after the NativeAPI.actor.cpp
call mentioned in the description, I was able to generate the following log messages:
<Event Severity="10" Time="1613676359.255862" DateTime="2021-02-18T19:25:59Z" Type="[sfines]TransactionDebugGetRange" ID="0000000000000000" Req.limitBytes="256" More="1" Machine="127.0.0.1:32086" LogGroup="default" />
...
<Event Severity="10" Time="1613676359.257196" DateTime="2021-02-18T19:25:59Z" Type="[sfines]TransactionDebugGetRange" ID="0000000000000000" Req.limitBytes="1000" More="0" Machine="127.0.0.1:32086" LogGroup="default" />
So the first range request was requested with a byte limit of 256. Since the keys and values are listed as 512 each, that row exceeds the target byte limit.
I speculate that the server will just automatically set more=true
whenever the target bytes limit is exceeded, even if the entire range result is contained in the response.
I believe you are correct that the storage server will return once the limit is reached with more
set to true
, even if the range is in fact exhausted. This seems like a good opportunity for some optimization.
I think there may also be a case where you need to do an extra range read when your range spans multiple shards. If all of the keys for that range live in one shard and the portion of the range in the second shard is empty, it's still probably going to happen that you need to issue two reads. This is a rarer case, though.
The storage engine (and therefore the storage server) will only know if the range is exhausted if it moves its cursor forward to get the next KV pair and compare the key to the range end after the byte limit has been hit.
If the client won't be requesting more data, this is unnecessary work and could result in unnecessary disk reads, particularly if the next KV pair is large. If the check confirms there is more data and the client does request it, it will likely request it from a different storage engine, although this is generally a problem with load balancing range reads across a storage team.
If an application does a lot of range reads of limit 1 to find the next key after/before some start key and without continuing to read records in that direction, then having the storage engine iterate forward 1 more record each time could be a significant relative work increase.
Ideally, a read request would include some sort of flag to indicate its intentions, which would then be plumbed into the storage engine's readRange function so it can perform this check only when useful to the client.
I'm not certain about this, but I think it may be the case that usages of byte limits are mostly artifacts of how range scanning works rather than explicit attempts to control the byte counts. The C bindings are an exception to this potentially, as any user of them could specify their own byte limits. At least some of the high-level bindings don't allow you to specify byte limits explicitly at all.
In light of the above, it may be reasonable in the case of the byte limit to allow reading one extra entry. On the other hand, there are probably other changes to the iteration behavior (e.g. not starting with such a small limit) that could also make this problem much better.
It occurred to me that if the storage server goes to the next record to see if it is in range, and the answer is yes, why not just return it? Sure, the byte limit has been reached already, but returning the additional record prevents it from being requested again immediately, and potentially from a different StorageServer that didn't just read it into cache. But then of course the "more" flag would have to be set true, and if that additional record was in fact last one then the next request is now unnecessary. So that would sort of just move the problem.
It seems like this problem is most severe when one getRange()
latency becomes two latencies but the first read actually did include all results. @scottfines Is this the situation you are mostly targeting, or was that just an intentionally simple example?
Two range read latencies becoming three is of course less of an increase, and with each additional sub-read the relative cost of any final empty read is less. When exactly does it cease to be a problem?
Perhaps for the first sub-read that a client issues that is part of a larger range read, the next-record-check is particularly important to avoid a latency doubling. So perhaps the client could have a firstInSeries
flag to get a more accurate more
flag in the response. Though is still seems a bit silly to move to the record but not return it, and a larger sub-read size limit would basically achieve the same thing.
Also, what about this situation?
more
set truemore
set false.
The second request is not empty, but it is 2% the size of the first request, and still increasing the round trip count from 1 to 2.
If the next-record check had been done here after the byte limit was reached, just including the additional record in the result would have saved a second round trip.This post is basically a summary of my brainstorming on this issue. I'm tempted to conclude that
@sfc-gh-satherton I think what you say is true, and this is part of the argument for changing the iteration byte limit progression. There are many use cases that this progression is not helpful for, and I'm not sure how often it ends up being beneficial.
For example, if we just always used something like WANT_ALL
(i.e. 80KB limit), then we would likely return multiple results much more often and the frequency of extra reads that return nothing in most cases would be diminished.
There will always be cases we can imagine that this isn't perfect, such as your example with ranges having 51K bytes of data and a 50K limit. This is probably much less bad if it's not happening to every range read, though, and in most cases it probably isn't. Of course, if each kv-pair is larger than this limit, then we still have the same issue.
The case that does seem to be fairly common, though, is ranging reading a range that typically has only 1 key. It seems like we should optimize for this case such that we don't have to do multiple range reads of this range when there is only 1 key.
The bindings should probably make the sub-read byte limit specifiable or at least influenceable
We do have the streaming modes that allow you to influence the byte limits, and this could be one way for users to optimize those range reads. As mentioned above, though, the default behavior here may not actually be a good default and changing the client library behavior could help.
Looking at running benchmarks, I began counting calls to
getRange_internal
inFDBTransaction.java
, and noticed that, in a specific circumstance, it's possible to issue two native requests in order to fetch a single row.Specifically, consider the following simple test:
Running this test will fail because
internalCallCounter
will be 2, not 1. Some debugging and I found that the row is returned in the first requested chunk, but thatmore
is true within theAsyncRangeIterator
, so it issues another request. That second request is empty, and superfluous.By putting some trace statements within
NativeAPI.actor.cpp
, I was able to track themore
call to the blockwhich I believe is the actual network call to the server for
getKeyValues
, but I'm not 100% sure of. At any rate, this call sets_rep.more = true
on the first call.If I am correct about the block call above, this means that
more
is being returned by the server during the range call, even though there is no more data to be returned.In a perfect world, this additional network call would not be required.