Closed rngcntr closed 3 years ago
Having such a config seems to be useful. Per janusgraph instance makes sense to me because we may generally don't want to fire super huge queries to storage backends.
As I said, an option would be to make the limit depend on previous NoOpBarrierSteps. They have a default limit of 2500 so we could make a global setting to either use this value for MultiQueries or ignore it (just as it is right now).
I will prepare a PR for TinkerPop to make the maxBarrierSize
of NoOpBarrierStep
s public so that we can use it here.
@rngcntr Should we close this issue? Looks like it was fixed by your PR.
MultiQueries are useful to group backend queries instead of sending them one-by-one and suffering a round-trip penalty for each query. This, however, can lead to a huge overhead for queries with defined
limit()
. As an example, consider the following profile:The profiled query performs three
in()
steps and checks if the result contains at least 5000 vertices. The firstin()
step yields a result set of 973 adjacent vertices. The secondin()
step takes all of these 973 vertices and constructs a MultiQuery against the backend, which returns 17514 vertices. Lastly, the thirdin()
step sends these to the backend in an extremely large MultiQuery which then returns 124000 vertices, while only 5000 of those are actually needed.Without MultiQueries, a Gremlin query like this would be able to terminate early without querying the backend for all the adjacent vertices, but would suffer a latency penalty for each small query. To combine the best of both worlds, a configurable batch size for MultiQueries would help to avoid loading way more entries than needed, while keeping the latency penalty low.
The question remains, whether the max size should be configured per JanusGraph instance, per query or even per step. A nice solution would be to use the barrier size of preceeding
NoOpBarrierStep
s to determine the batch size of the followingVertexStep
. Sadly, the NoOpBarrierStep does not make itsmaxBarrierSize
public.