MillenniumDB / WDBench

Benchmark resources
10 stars 5 forks source link

Queries with graph patterns that don't have a variable in common #4

Open hannahbast opened 8 months ago

hannahbast commented 8 months ago

We just stumbled upon this benchmark and we wondered:

Is it intentional that several of the queries decompose into parts that have no variables in common? For example, the query in line 4 of https://github.com/MillenniumDB/WDBench/blob/master/Queries/multiple_bgps.txt is:

?x1 <http://www.wikidata.org/prop/direct/P1010> ?x2 .
?x3 <http://www.wikidata.org/prop/direct/P18> ?x4 . 

That is, to compute the query, you need to compute the Cartesian product, which is huge (17,553 * 5,174,492 = 90,827,858,076 rows on the current version of the complete Wikidata). Of course, it's a valid SPARQL query, but it doesn't really make sense.

DomagojVrgoc commented 8 months ago

These were posted by the users. Notice that we sometimes prune some parts of the query (e.g. SERVICE), but this should not result in a cross product.

Regarding the utility of these queries; on the one hand, one could argue that when coupled with LIMIT these test that an engine does not materialize results when this is not needed.

On the other hand, we agree with your conclusion that such queries are artificially difficult. In fact, we are currently working on WDBench 1.1 which cleans up the query set to remove the cross product. If you check the github, there is a "remove_cross_product" branch which should not contain cross product queries (nor any disconnected queries which should force a cross product). For us, this branch is the most up to date version of the benchmark, but it is not published in a systematic way (it will be soon).

hannahbast commented 8 months ago

Thanks for the quick response. I just checked https://github.com/MillenniumDB/WDBench/blob/remove_cross_product/Queries/opts.txt and there is still a query, which requires the computation of the Cartesian product, namely in line 10:

?var1 <http://www.wikidata.org/prop/direct/P106> <http://www.wikidata.org/entity/Q82955> .
?var2 <http://www.wikidata.org/prop/direct/P106> <http://www.wikidata.org/entity/Q82955> .
OPTIONAL { ?var1 <http://www.wikidata.org/prop/direct/P22> ?var2 . } 

The query looks connected because both ?var1 and ?var2 occur in the OPTIONAL clause. But because it's OPTIONAL, you still need to compute the Cartesian product of the first two patterns.

There are also two queries with huge results for other reasons, namely the query in line 84 (1.5B rows) and the query in line 135 (0.9B rows). I think it's OK to include such queries, even if they are not particularly meaningful because users frequently formulate queries that are not particularly meaningful. However, the average is a misleading measure to report because few outliers can distort the average significantly. Percentiles are more robust, a standard would be 50%-ile (which you already report), 95%-ile, 99%-ile, max.

DomagojVrgoc commented 8 months ago

The query you mention is actually connected by our definition. Namely, we say that a BGP pattern is connected if when forming a graph that contains all the variables and constants appearing in subject or object position (and edges for them); the resulting graph is connected. Basically, for us Q82955 allows the query to be "connected", so no cross product is needed necessarily. I agree that this is somewhat debatable; particularly if one takes the variable only approach to query planning.

Regarding the large number of results, I'd like to point out that we recommend queries to be run with a 100k limit. When reporting results we only consider executing them in this manner. It does not come across in the patterns, since they do not contain full queries, so I understand that this can be easily missed. If you take a look at our scripts, you can see that they wrap the pattern inside of a SELECT * WHERE {pattern} LIMIT 100000 query.

I also agree that average is misleading. In the publications we also provide box plots, so some quartiles are visible. Adding 95% or 99% is actually a good idea.