apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.38k stars 1.3k forks source link

[Java] RangeQueries do additional unnecessary native Range calls #4357

Open scottfines opened 3 years ago

scottfines commented 3 years ago

Looking at running benchmarks, I began counting calls to getRange_internal in FDBTransaction.java, and noticed that, in a specific circumstance, it's possible to issue two native requests in order to fetch a single row.

Specifically, consider the following simple test:

  @Test
    public void onlyDoesOneNativeFetchForLargeKey() throws Exception{
        FDB fdb = FDB.selectApiVersion(700);
        Random rand = new Random();
        byte[] key= new byte[512];
        byte[] value = new byte[512];
        rand.nextBytes(key);
        rand.nextBytes(value);

        try(Database db = fdb.open()){
            db.run(tr->{
                tr.set(key,value);
            });
            db.run(tr ->{
               final AtomicLong internalCallCounter = new AtomicLong(0L);
                //getPtrDirect() is a temporary function put in FDBTransaction to make it possible to do this delegation easily
              FDBTransaction txn = new FDBTransaction(((FDBTransaction) tr).getPtrDirect(), db, db.getExecutor()) {
              @Override
              protected FutureResults getRange_internal(KeySelector begin, KeySelector end, int rowLimit,
                        int targetBytes, int streamingMode, int iteration, boolean isSnapshot, boolean reverse) {
                    internalCallCounter.incrementAndGet();
                    return super.getRange_internal(begin, end, rowLimit, targetBytes, streamingMode, iteration,
                               isSnapshot, reverse);
                 }
             };
             AsyncIterator<KeyValue> kvs = txn.getRange(Range.startsWith(new byte[] { key[0] })).iterator();

             Assert.assertTrue("Did not return a record!", kvs.hasNext());
             KeyValue n = kvs.next();
             Assert.assertArrayEquals("Did not return a key correctly!", key, n.getKey());
             Assert.assertArrayEquals("Did not return the corect value!", value, n.getValue());

             // now check the counter to see how many fetches we did
            Assert.assertEquals("performed too many getRange_internal fetches!", 1, internalCallCounter);
            return null;
         });
    }
}

Running this test will fail because internalCallCounter will be 2, not 1. Some debugging and I found that the row is returned in the first requested chunk, but that more is true within the AsyncRangeIterator, so it issues another request. That second request is empty, and superfluous.

By putting some trace statements within NativeAPI.actor.cpp, I was able to track the more call to the block

GetKeyValuesReply _rep =
    wait(loadBalance(cx.getPtr(), beginServer.second, &StorageServerInterface::getKeyValues, req,
            TaskPriority::DefaultPromiseEndpoint, false,
            cx->enableLocalityLoadBalance ? &cx->queueModel : nullptr));

which I believe is the actual network call to the server for getKeyValues, but I'm not 100% sure of. At any rate, this call sets _rep.more = true on the first call.

If I am correct about the block call above, this means that more is being returned by the server during the range call, even though there is no more data to be returned.

In a perfect world, this additional network call would not be required.

scottfines commented 3 years ago

Some more debugging:

by putting some TraceEvent calls right after the NativeAPI.actor.cpp call mentioned in the description, I was able to generate the following log messages:

<Event Severity="10" Time="1613676359.255862" DateTime="2021-02-18T19:25:59Z" Type="[sfines]TransactionDebugGetRange" ID="0000000000000000" Req.limitBytes="256" More="1" Machine="127.0.0.1:32086" LogGroup="default" />
...
<Event Severity="10" Time="1613676359.257196" DateTime="2021-02-18T19:25:59Z" Type="[sfines]TransactionDebugGetRange" ID="0000000000000000" Req.limitBytes="1000" More="0" Machine="127.0.0.1:32086" LogGroup="default" />

So the first range request was requested with a byte limit of 256. Since the keys and values are listed as 512 each, that row exceeds the target byte limit.

I speculate that the server will just automatically set more=true whenever the target bytes limit is exceeded, even if the entire range result is contained in the response.

sfc-gh-abeamon commented 3 years ago

I believe you are correct that the storage server will return once the limit is reached with more set to true, even if the range is in fact exhausted. This seems like a good opportunity for some optimization.

I think there may also be a case where you need to do an extra range read when your range spans multiple shards. If all of the keys for that range live in one shard and the portion of the range in the second shard is empty, it's still probably going to happen that you need to issue two reads. This is a rarer case, though.

sfc-gh-satherton commented 3 years ago

The storage engine (and therefore the storage server) will only know if the range is exhausted if it moves its cursor forward to get the next KV pair and compare the key to the range end after the byte limit has been hit.

If the client won't be requesting more data, this is unnecessary work and could result in unnecessary disk reads, particularly if the next KV pair is large. If the check confirms there is more data and the client does request it, it will likely request it from a different storage engine, although this is generally a problem with load balancing range reads across a storage team.

If an application does a lot of range reads of limit 1 to find the next key after/before some start key and without continuing to read records in that direction, then having the storage engine iterate forward 1 more record each time could be a significant relative work increase.

Ideally, a read request would include some sort of flag to indicate its intentions, which would then be plumbed into the storage engine's readRange function so it can perform this check only when useful to the client.

sfc-gh-abeamon commented 3 years ago

I'm not certain about this, but I think it may be the case that usages of byte limits are mostly artifacts of how range scanning works rather than explicit attempts to control the byte counts. The C bindings are an exception to this potentially, as any user of them could specify their own byte limits. At least some of the high-level bindings don't allow you to specify byte limits explicitly at all.

In light of the above, it may be reasonable in the case of the byte limit to allow reading one extra entry. On the other hand, there are probably other changes to the iteration behavior (e.g. not starting with such a small limit) that could also make this problem much better.

sfc-gh-satherton commented 3 years ago

It occurred to me that if the storage server goes to the next record to see if it is in range, and the answer is yes, why not just return it? Sure, the byte limit has been reached already, but returning the additional record prevents it from being requested again immediately, and potentially from a different StorageServer that didn't just read it into cache. But then of course the "more" flag would have to be set true, and if that additional record was in fact last one then the next request is now unnecessary. So that would sort of just move the problem.

It seems like this problem is most severe when one getRange() latency becomes two latencies but the first read actually did include all results. @scottfines Is this the situation you are mostly targeting, or was that just an intentionally simple example?

Two range read latencies becoming three is of course less of an increase, and with each additional sub-read the relative cost of any final empty read is less. When exactly does it cease to be a problem?

Perhaps for the first sub-read that a client issues that is part of a larger range read, the next-record-check is particularly important to avoid a latency doubling. So perhaps the client could have a firstInSeries flag to get a more accurate more flag in the response. Though is still seems a bit silly to move to the record but not return it, and a larger sub-read size limit would basically achieve the same thing.

Also, what about this situation?

This post is basically a summary of my brainstorming on this issue. I'm tempted to conclude that

sfc-gh-abeamon commented 3 years ago

@sfc-gh-satherton I think what you say is true, and this is part of the argument for changing the iteration byte limit progression. There are many use cases that this progression is not helpful for, and I'm not sure how often it ends up being beneficial.

For example, if we just always used something like WANT_ALL (i.e. 80KB limit), then we would likely return multiple results much more often and the frequency of extra reads that return nothing in most cases would be diminished.

There will always be cases we can imagine that this isn't perfect, such as your example with ranges having 51K bytes of data and a 50K limit. This is probably much less bad if it's not happening to every range read, though, and in most cases it probably isn't. Of course, if each kv-pair is larger than this limit, then we still have the same issue.

The case that does seem to be fairly common, though, is ranging reading a range that typically has only 1 key. It seems like we should optimize for this case such that we don't have to do multiple range reads of this range when there is only 1 key.

The bindings should probably make the sub-read byte limit specifiable or at least influenceable

We do have the streaming modes that allow you to influence the byte limits, and this could be one way for users to optimize those range reads. As mentioned above, though, the default behavior here may not actually be a good default and changing the client library behavior could help.