Open pfps opened 1 week ago
@pfps Thanks for another monster query! This variant of the query does the job. It takes around 10 minutes (because it has to produce the complete wdt:P317/wdt:P279*
, which has 3.7 B results and is expensive to compute), but it works and with little RAM at that:
SELECT * WHERE {
{ SELECT DISTINCT ?s WHERE { ?s wdt:P31/wdt:P279* ?o . FILTER (?s = ?o) } }
OPTIONAL { s rdfs:label ?s_label FILTER (LANG(?s) = "en") }
}
https://qlever.cs.uni-freiburg.de/wikidata/saPuxK
Your ?c wdt:P31/wdt:P279* ?c
should actually be treated just like ?c wdt:P31/wdt:P279* ?tmp . FILTER (?c = ?tmp)
. We will investigate why QLever does not do that.
I have a sequence of queries that try to do similar things, which I ran over the summer when looking for issues in the Wikidata ontology. These queries look for instance loops, i.e, a class that is an instance of itself when the intended meaning of P279 (instances of subclasses are also instances) is taken into account.
The desired query is:
SELECT DISTINCT ?c ?cLabel WHERE { ?c wdt:P31/wdt:P279* ?c . OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }
but that triggers the empty path error.
A closely related query that sidesteps the empty path message is:
SELECT DISTINCT ?c ?cLabel WHERE { ?c wdt:P31/wdt:P279+ ?c . OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }
This runs in 22 seconds but might not include all the results I want because it excludes the empty subclass path.
The query that I created to include the empty subclass path is:
EDIT: The query I had here was a slow one. The fast query is as shown now.
SELECT DISTINCT ?c ?cLabel WHERE { { ?c wdt:P31/wdt:P279+ ?c } UNION { ?c wdt:P31 ?c } OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }
This query runs in 24 seconds, much faster than your version.
I'm going to try various variants of this query and see how fast they run.
There is a very large speed difference between
SELECT DISTINCT ?c WHERE {
?c wdt:P31/wdt:P279+ ?c .
}
https://qlever.cs.uni-freiburg.de/wikidata/64XbsS
and
SELECT DISTINCT ?c WHERE {
?c wdt:P31/wdt:P279+ ?s . FILTER ( ?c = ?s )
}
https://qlever.cs.uni-freiburg.de/wikidata/69cPqa
I seem to remember that the second query used to be reasonably fast, but I may be remembering wrong.
@pfps If you execute the queries and click on "Analysis", you see the difference:
The first query uses a multi-column join, which is reasonably fast in this case because both wdt:P31
and wdt:P279+
have a manageable size, namely around 120M rows each.
The second query lazily produces the complete wdt:P31/wdt:P279+
result, which has over 3 billion rows, and then filters it.
Of course, QLever could recognize that these two queries are equivalent and pick the query plan that is faster to execute. This would be easy to fix for this particular kind of query, but it is an extremely hard problem in general.
It is important to note that these are very particular queries.
The query
https://qlever.cs.uni-freiburg.de/wikidata-test/SBRpDD
SELECT DISTINCT ?c ?cLabel WHERE { { ?c wdt:P31/wdt:P279* ?c } OPTIONAL { ?c rdfs:label ?cLabel . FILTER ( lang(?cLabel) = 'en' ) } }
results in the message
This query might have to evaluate the empty path, which is currently not supported. In file "/home/local/qlever/qlever-code/src/engine/TransitivePathImpl.h " at line 158
But it doesn't appear that there is an empty path in this query.