Estimate selectivity of Index Queries

JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database

https://janusgraph.org

Other

5.29k stars 1.17k forks source link

Estimate selectivity of Index Queries #2096

Open rngcntr opened 4 years ago

rngcntr commented 4 years ago

After gaining some insights to the GraphCentricQueryBuilder and the SubqueryIterator internals thanks to @li-boxuan's PRs (#2040, #2055), I think we can improve the selection of indexes to answer a query even more. The current implementation tries to cover as many conditions as possible by consulting one or more indexes (see #2048). Concerning the selection of indexes, our only measure of how useful an index is, is the number of conditions it covers.

In the long term, we should focus on selecting indexes by (an estimation of) their selectivity. Consider the following example. Assume we are querying a database of cars and have two indexes:

carByModelName covers the property name
carBySeatsAndGears covers the properties seats and gears

For the query

g.V().hasLabel(car).has('name', 'Edison Model S').has('seats', 5).has('gears', 1)

the current Implementation would prefer the index carBySeatsAndGears, because it covers two conditions in contrast to carByModelName which only covers one. Obviously, it would be more efficient to consult carByModelName, which reduces the intermediate result size immensely.

To achieve this, it is necessary to somehow track the selectivity of index queries. This could be done by keeping track of inserted elements by maintaining histograms. An easier (and probably less effective) solution would be to keep track of how selective the last N queries to an index were and thereby estimate the selectivity of the N+1st query.

In theory, if we can improve the index selection to always find the most selective index, we can achieve another gain in performance by streaming the results of the first index query and matching all remaining conditions on the fly instead of waiting for the (larger) result sets of other indexes and intersect them.

li-boxuan commented 4 years ago

Totally agree. @dk-github also proposed similar things at https://github.com/JanusGraph/janusgraph/issues/2048#issuecomment-601278988 and https://github.com/JanusGraph/janusgraph/issues/2032#issuecomment-596080054. The original author of Titan also mentioned https://github.com/JanusGraph/janusgraph/blob/051c56053c398a7b032838908afb4e7bee7c6c4f/janusgraph-core/src/main/java/org/janusgraph/graphdb/query/graph/GraphCentricQueryBuilder.java#L356

This sounds like a long-time solution and ultimate solution but requires a deep understanding of the codebase (e.g., how to derive/calculate/store/load selectivity information) and significant efforts as well. Thus I chose to make some small incremental efforts at https://github.com/JanusGraph/janusgraph/pull/2040 and https://github.com/JanusGraph/janusgraph/pull/2055, but I would be happy to see someone starts working on the long-term solution. If possible, we can try to break this epic down into several smaller stories.

From my understanding, having a complete picture of indexes' selectivity has three major benefits:

A much more powerful criterion when doing index selection in GraphCentricQueryBuilder. As you mentioned, it is not good enough to select based on No. of covered conditions (as well as whether it's composite or mixed, whether it's an exact matching, etc.).
Guide on how to select which index queries we should prioritize and by how much. The original implementation was very naive, and I tried to give a better estimation at https://github.com/JanusGraph/janusgraph/pull/2040/files#r410597752 where I try to evaluate the selectivity of an index on the fly, but whether that works is still a doubt.
Guide on which index queries we should execute in the first place. Streaming on the fly could be a choice, in-memory filtering (rather than really executing the index query and doing intersection) could also be a choice.

I guess the first step is to design a structure of index information/usage and somehow persist it. Then we can make the most of this information in many places around index queries.

li-boxuan commented 4 years ago

https://github.com/JanusGraph/janusgraph/issues/270 might also be related, which requests a feature for users to define priority themselves.

rngcntr commented 4 years ago

Hi @li-boxuan, thanks for your comment! I totally agree with all of your ideas, let me just add a few thoughts:

First of all, this is indeed a long term goal and not something we can fix right away.
Given that we are eventually able to predict the selectivity of an index access, #270 could even become obsolete, as there would be no point in selecting any other index than the "best" one. (Only theoretically speaking of course, I'm pretty sure there are exceptions to this). Apart from that, an approach like this could serve as an intermediate solution.
Regarding your point 3: The hard question here is to get an idea of how correlated multiple indexes are. Selecting the first index is relatively easy compared to this problem. When I wrote this comment yesterday, I did not realize that an index call does only return the ID of an element rather than the element itself. In the latter case, it would always be more efficient to query only one index and filter the results in-memory afterwards. But as we don't have this case, the decision to be made for each index is whether it is more expensive to execute the index query or to do in-memory filtering which requires loading more elements but is cheaper to compute. This is especially true if the index has a very low selectivity (-> index result >> number of candidates) and loading all of the candidates can be done by a range query, as we would lose this advantage by filtering early.

porunov commented 4 years ago

Thank you @rngcntr and @li-boxuan for working on this! I have some thoughts to share.

To achieve this, it is necessary to somehow track the selectivity of index queries. This could be done by keeping track of inserted elements by maintaining histograms.

As JanusGraph is a distributed database, we can't be sure how much elements all instances inserted. Maybe the user uses server A for insertions and server B for reads. Maybe, we could synchronize track of inserted elements by using the same technique as vertex id assignment works. We could track insertions per each server and synchronize all statistics with some delays.

Given that we are eventually able to predict the selectivity of an index access, #270 could even become obsolete, as there would be no point in selecting any other index than the "best" one. (Only theoretically speaking of course, I'm pretty sure there are exceptions to this).

Possible, but hard to say what is the "best" one. #270 would allow user to choose prioritized index and maybe user knows better which index is better than prediction on the last N queries or numbers of insertions. Of course, if we are able to choose really the "best" one, then #270 won't be useful anymore.

Also, it is interesting, how much data does "keeping track of inserted elements" would take. There is no much sense in knowing just "total amount of indexes elements", so, I guess we need to keep track of indexed elements per entity or per query. That could take a lot of data, so most likely we won't be able to keep that information in RAM. For example, getting back to g.V().hasLabel(car).has('name', 'Edison Model S').has('seats', 5).has('gears', 1), lets imagine that g.V().hasLabel(car).has('name', 'A') would return 5 entities but g.V().hasLabel(car).has('name', 'B') would return 1 million entities, thus, which index do we need to chose for the next query g.V().hasLabel(car).has('name', 'C').has('seats', 5).has('gears', 1)? It is hard to say, as we don't yet know the result amount per C. We could, of course use some prediction (like (result size from A + result size from B) / 2), maybe AI or something like that to find out if we should choose index carByModelName or carBySeatsAndGears here.

This indeed looks like a long-term solution and it would require a proper research

rngcntr commented 4 years ago

The first thing that comes to my mind as a way of "keeping track" would be inmemory histograms. One can classify inserted values into a number of buckets and use the bucket counter as an approximation (as in "hard limit") of how many matching values there are.

rngcntr commented 4 years ago

I'd like to continue the discussion on this topic because this is something I expect to have a massive impact on performance. I have recently studied the use of histograms by relational DBMS's and it looks like what most systems do is putting the user in charge of keeping histograms up to date. Building such a histogram is a trivial but time consuming task. Therefore, other systems only generate histograms on explicit request by the user. These are then stored and not affected by future insertions and deletions. If the data distribution approximately stays the same over time (think of hash values as an extreme case), then taking this effort once is enough to guarantee valuable estimations. If distribution changes, the user has to make the decision to refresh the histogram every now and then.

Following this approach would give us the following benefits:

Ability to estimate selectivities of index queries
(relatively) easy synchronization between JanusGraph instances, because data is rarely updated
backwards compatibility by not forcing statistics but keeping them optional

li-boxuan commented 4 years ago

@rngcntr Sounds interesting. Would you mind sharing some links that describe the use of histograms by relational DBMS? Did you also find anything similar in NoSQL databases?

rngcntr commented 4 years ago

I can recommend reading https://www.slideshare.net/SergeyPetrunya/histograms-in-mariadb-mysql-and-postgresql for a quick introduction

rngcntr commented 1 year ago

Therefore, other systems only generate histograms on explicit request by the user.

This seems to be an ideal use case for TinkerPop's new call() step. I'd imagine calls like g.call('update-histograms').with('property', 'seats').