ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
424 stars 52 forks source link

incorrect empty path determination? #1628

Open pfps opened 1 week ago

pfps commented 1 week ago

The query

https://qlever.cs.uni-freiburg.de/wikidata-test/SBRpDD

SELECT DISTINCT ?c ?cLabel WHERE { { ?c wdt:P31/wdt:P279* ?c } OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }

results in the message

This query might have to evaluate the empty path, which is currently not supported. In file "/home/local/qlever/qlever-code/src/engine/TransitivePathImpl.h " at line 158

But it doesn't appear that there is an empty path in this query.

hannahbast commented 6 days ago

@pfps Thanks for another monster query! This variant of the query does the job. It takes around 10 minutes (because it has to produce the complete wdt:P317/wdt:P279*, which has 3.7 B results and is expensive to compute), but it works and with little RAM at that:

SELECT * WHERE {
  { SELECT DISTINCT ?s WHERE { ?s wdt:P31/wdt:P279* ?o . FILTER (?s = ?o) } }
  OPTIONAL { s rdfs:label ?s_label FILTER (LANG(?s) = "en") }
}

https://qlever.cs.uni-freiburg.de/wikidata/saPuxK

Your ?c wdt:P31/wdt:P279* ?c should actually be treated just like ?c wdt:P31/wdt:P279* ?tmp . FILTER (?c = ?tmp). We will investigate why QLever does not do that.

pfps commented 6 days ago

I have a sequence of queries that try to do similar things, which I ran over the summer when looking for issues in the Wikidata ontology. These queries look for instance loops, i.e, a class that is an instance of itself when the intended meaning of P279 (instances of subclasses are also instances) is taken into account.

The desired query is:

SELECT DISTINCT ?c ?cLabel WHERE { ?c wdt:P31/wdt:P279* ?c . OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }

but that triggers the empty path error.

A closely related query that sidesteps the empty path message is:

SELECT DISTINCT ?c ?cLabel WHERE { ?c wdt:P31/wdt:P279+ ?c . OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }

This runs in 22 seconds but might not include all the results I want because it excludes the empty subclass path.

The query that I created to include the empty subclass path is:

EDIT: The query I had here was a slow one. The fast query is as shown now.

SELECT DISTINCT ?c ?cLabel WHERE { { ?c wdt:P31/wdt:P279+ ?c } UNION { ?c wdt:P31 ?c } OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }

This query runs in 24 seconds, much faster than your version.

I'm going to try various variants of this query and see how fast they run.

pfps commented 6 days ago

There is a very large speed difference between

SELECT DISTINCT ?c WHERE { 
  ?c wdt:P31/wdt:P279+ ?c . 
}

https://qlever.cs.uni-freiburg.de/wikidata/64XbsS

and

SELECT DISTINCT ?c WHERE { 
  ?c wdt:P31/wdt:P279+ ?s . FILTER ( ?c = ?s )
}

https://qlever.cs.uni-freiburg.de/wikidata/69cPqa

I seem to remember that the second query used to be reasonably fast, but I may be remembering wrong.

hannahbast commented 4 days ago

@pfps If you execute the queries and click on "Analysis", you see the difference:

  1. The first query uses a multi-column join, which is reasonably fast in this case because both wdt:P31 and wdt:P279+ have a manageable size, namely around 120M rows each.

  2. The second query lazily produces the complete wdt:P31/wdt:P279+ result, which has over 3 billion rows, and then filters it.

Of course, QLever could recognize that these two queries are equivalent and pick the query plan that is faster to execute. This would be easy to fix for this particular kind of query, but it is an extremely hard problem in general.

It is important to note that these are very particular queries.