Open rngcntr opened 4 years ago
Totally agree. @dk-github also proposed similar things at https://github.com/JanusGraph/janusgraph/issues/2048#issuecomment-601278988 and https://github.com/JanusGraph/janusgraph/issues/2032#issuecomment-596080054. The original author of Titan also mentioned https://github.com/JanusGraph/janusgraph/blob/051c56053c398a7b032838908afb4e7bee7c6c4f/janusgraph-core/src/main/java/org/janusgraph/graphdb/query/graph/GraphCentricQueryBuilder.java#L356
This sounds like a long-time solution and ultimate solution but requires a deep understanding of the codebase (e.g., how to derive/calculate/store/load selectivity information) and significant efforts as well. Thus I chose to make some small incremental efforts at https://github.com/JanusGraph/janusgraph/pull/2040 and https://github.com/JanusGraph/janusgraph/pull/2055, but I would be happy to see someone starts working on the long-term solution. If possible, we can try to break this epic down into several smaller stories.
From my understanding, having a complete picture of indexes' selectivity has three major benefits:
A much more powerful criterion when doing index selection in GraphCentricQueryBuilder. As you mentioned, it is not good enough to select based on No. of covered conditions (as well as whether it's composite or mixed, whether it's an exact matching, etc.).
Guide on how to select which index queries we should prioritize and by how much. The original implementation was very naive, and I tried to give a better estimation at https://github.com/JanusGraph/janusgraph/pull/2040/files#r410597752 where I try to evaluate the selectivity of an index on the fly, but whether that works is still a doubt.
Guide on which index queries we should execute in the first place. Streaming on the fly could be a choice, in-memory filtering (rather than really executing the index query and doing intersection) could also be a choice.
I guess the first step is to design a structure of index information/usage and somehow persist it. Then we can make the most of this information in many places around index queries.
https://github.com/JanusGraph/janusgraph/issues/270 might also be related, which requests a feature for users to define priority themselves.
Hi @li-boxuan, thanks for your comment! I totally agree with all of your ideas, let me just add a few thoughts:
Thank you @rngcntr and @li-boxuan for working on this! I have some thoughts to share.
To achieve this, it is necessary to somehow track the selectivity of index queries. This could be done by keeping track of inserted elements by maintaining histograms.
As JanusGraph is a distributed database, we can't be sure how much elements all instances inserted. Maybe the user uses server A for insertions and server B for reads. Maybe, we could synchronize track of inserted elements by using the same technique as vertex id assignment works. We could track insertions per each server and synchronize all statistics with some delays.
Given that we are eventually able to predict the selectivity of an index access, #270 could even become obsolete, as there would be no point in selecting any other index than the "best" one. (Only theoretically speaking of course, I'm pretty sure there are exceptions to this).
Possible, but hard to say what is the "best" one. #270 would allow user to choose prioritized index and maybe user knows better which index is better than prediction on the last N queries or numbers of insertions. Of course, if we are able to choose really the "best" one, then #270 won't be useful anymore.
Also, it is interesting, how much data does "keeping track of inserted elements" would take. There is no much sense in knowing just "total amount of indexes elements", so, I guess we need to keep track of indexed elements per entity or per query. That could take a lot of data, so most likely we won't be able to keep that information in RAM.
For example, getting back to g.V().hasLabel(car).has('name', 'Edison Model S').has('seats', 5).has('gears', 1)
, lets imagine that g.V().hasLabel(car).has('name', 'A')
would return 5 entities but g.V().hasLabel(car).has('name', 'B')
would return 1 million entities, thus, which index do we need to chose for the next query g.V().hasLabel(car).has('name', 'C').has('seats', 5).has('gears', 1)
? It is hard to say, as we don't yet know the result amount per C
. We could, of course use some prediction (like (result size from A + result size from B) / 2
), maybe AI or something like that to find out if we should choose index carByModelName
or carBySeatsAndGears
here.
This indeed looks like a long-term solution and it would require a proper research
The first thing that comes to my mind as a way of "keeping track" would be inmemory histograms. One can classify inserted values into a number of buckets and use the bucket counter as an approximation (as in "hard limit") of how many matching values there are.
I'd like to continue the discussion on this topic because this is something I expect to have a massive impact on performance. I have recently studied the use of histograms by relational DBMS's and it looks like what most systems do is putting the user in charge of keeping histograms up to date. Building such a histogram is a trivial but time consuming task. Therefore, other systems only generate histograms on explicit request by the user. These are then stored and not affected by future insertions and deletions. If the data distribution approximately stays the same over time (think of hash values as an extreme case), then taking this effort once is enough to guarantee valuable estimations. If distribution changes, the user has to make the decision to refresh the histogram every now and then.
Following this approach would give us the following benefits:
@rngcntr Sounds interesting. Would you mind sharing some links that describe the use of histograms by relational DBMS? Did you also find anything similar in NoSQL databases?
I can recommend reading https://www.slideshare.net/SergeyPetrunya/histograms-in-mariadb-mysql-and-postgresql for a quick introduction
Therefore, other systems only generate histograms on explicit request by the user.
This seems to be an ideal use case for TinkerPop's new call() step. I'd imagine calls like
g.call('update-histograms').with('property', 'seats')
.
After gaining some insights to the
GraphCentricQueryBuilder
and theSubqueryIterator
internals thanks to @li-boxuan's PRs (#2040, #2055), I think we can improve the selection of indexes to answer a query even more. The current implementation tries to cover as many conditions as possible by consulting one or more indexes (see #2048). Concerning the selection of indexes, our only measure of how useful an index is, is the number of conditions it covers.In the long term, we should focus on selecting indexes by (an estimation of) their selectivity. Consider the following example. Assume we are querying a database of cars and have two indexes:
carByModelName
covers the propertyname
carBySeatsAndGears
covers the propertiesseats
andgears
For the query
the current Implementation would prefer the index
carBySeatsAndGears
, because it covers two conditions in contrast tocarByModelName
which only covers one. Obviously, it would be more efficient to consultcarByModelName
, which reduces the intermediate result size immensely.To achieve this, it is necessary to somehow track the selectivity of index queries. This could be done by keeping track of inserted elements by maintaining histograms. An easier (and probably less effective) solution would be to keep track of how selective the last
N
queries to an index were and thereby estimate the selectivity of theN+1
st query.In theory, if we can improve the index selection to always find the most selective index, we can achieve another gain in performance by streaming the results of the first index query and matching all remaining conditions on the fly instead of waiting for the (larger) result sets of other indexes and intersect them.