Closed VladimirAlexiev closed 7 years ago
necessarily slow
The reason is that ?x prop* ?y
must return EVERY node:
every node is connected by a zero-length path of any prop
to itself.
Some more thoughts on implementation.
First compute path nature (POS, NON) (see prop paths):
iri
is POS^elt
is sameelt1 / elt2
is POS if either one is POS, else NONelt1 | elt2
is POS if both are POS, else NONelt*
is NONelt+
is sameelt?
is NON!whatever
is POS (it's a path of length one)(elt)
is sameThen the following rewrite rules are executed repeatedly.
pX
are POS paths, nX
are NON paths, qX
are either, ()
is the empty path (which is NON).
Shorthands
The 2010 prop path language has some extras that boil down to:
elt1 ^ elt2
is shorthand for elt1 / ^elt2
elt{n}
is shorthand for elt/.../elt
(n
times without ?
)
elt{,n}
is shorthand for elt?/.../elt?
(n
times with ?
)
elt{n,}
is shorthand for elt/.../elt / elt*
(n
times without ?
then one with *
)
elt{n,m}
is shorthand for elt/.../elt / elt?/.../elt?
(n
without ?
then m-n
with ?
)
Prioritize
Push POS paths to the left of alternatives (I assume they are evaluated left to right). This matters for LIMIT queries.
n|p -> p|n
Expand Alternatives
Push alternatives to the top level.
^(q1|q2) -> ^q1 | ^q2
q1/(q2|q3) -> q1/q2 | q1/q3
(q1|q2)/q3 -> q1/q3 | q2/q3
(q1|q2)* -> q1* | q2*
(q1|q2)+ -> q1+ | q2+
(q1|q2)? -> q1? | q2?
q* -> q+ | ()
q? -> q | ()
Simplify
q|...|q -> q|...
(where ...
is any number of alternatives)
q/() -> q
()/q -> q
Simplify iterations
The expansion could create some consecutive iteration symbols that should also be simplified. But because such are allowed in the original path, I guess there already is such simplification:
q** -> q*
q*+ -> q*
q+* -> q*
q++ -> q+
q*? -> q*
q?* -> q*
q+? -> q?
q?+ -> q?
q?? -> q?
I think that's all that's needed.
Eg if you try the rewrites on Use Case 5 above, you get this sequence
crm:P62_depicts/(rdf:rest*/rdf:first)?
expand
crm:P62_depicts/(rdf:rest+|())/rdf:first)?
expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+|())/rdf:first)
expand
crm:P62_depicts | crm:P62_depicts/(rdf:rest+/rdf:first | ()/rdf:first)
expand
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/()/rdf:first
simplify
crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/rdf:first
This is correct and fast enough. But the most natural way to write it is the fastest, because it finds the shortest paths first (and returns elements in the desired order):
crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first
So I think we need these changes:
q|()
-> ()|q
n|p
-> p|n
only if n != ()
and then:
()|... -> ...
but only at top level
This means that a query ?x p* ?y
will be treated as ?x p+ ?y
and will NOT return every node. This may fail some conformance test, but is exactly the wanted behavior (whoever wants to find all nodes in a repo?)Possibly related is issue #689 , which turned out to be partly caused by a bug in binding retention, but also a related query optimization issue. We've improved query optimization for arbitrary-length paths in RDF4J 2.1.4 to the effect that the optimizer will almost always execute the zero-or-more part last (leading to most variables being bound already and therefore far less options to search through). I have not yet looked at your analysis in sufficient detail to figure out whether that corresponds with your findings, but in practical terms it might be good to try your slow queries with the new planner, and see if there's an improvement.
In addition to performance, it's important to return the shortest-path results first,
because that is the expected order (e.g. for list unrolling rdf:rest*/rdf:first
).
I confirmed the issue exists in RDF4J 2.1.3, 2.1.6, and 2.2 Using the dataset http://vocab.getty.edu/dataset/aat/full.zip I had results similar to the following on each. prefix gvp:http://vocab.getty.edu/ontology# ask { ?x gvp:broader|(gvp:broader/gvp:broaderGenericExtended) ?y } # ~25ms ask { ?x gvp:broader/gvp:broaderGenericExtended? ?y } # ~2500ms
It's been a while since I looked at this but we should investigate how the planner orders execution of the path elements and how it ensures that as much as possible, it executes paths with bound start or end points first.
As for output ordering: a SPARQL query result is unordered by nature, so any ordering that you rely on (other than an externally imposed one using an ORDER BY clause) is bound to give you problems in the future.
Also surprised that the issue still exists post 2.1.4 (and the fix for #689). Might be worth checking if that fix was somehow accidentally not included in later releases.
As for output ordering: a SPARQL query result is unordered by nature
It won't hurt to have a sensible order, esp for rdf:List
rdf:rest*/rdf:first
will get a nasty surprise if it's notlist:member
https://jena.apache.org/documentation/query/extension.html#property-functions returns them in order
(Created from OWLIM-1104).
Prop paths using optional constructs (
p?
andp*
) are extremely slow. These paths alone are necessarily slow because they must return all nodes. But nobody uses them alone, people always use them in combination with a non-optional path. So one has to use workarounds, eg rewriteq/p?
toq|q/p
, to get good perfomance.Use cases:
In Getty queries Vladimir Alexiev uses many rewritten patterns like this:
In OWLIM-2627 Barry Norton suggests a similar but wordier workaround for BM data:
SHACL uses rdf:List for many of its constructs (sometimes unnecessarily) and uses
rdf:rest*/rdf:first
to unroll the list, eg see ClosedConstraintComponent. Unless fixed, one can't use the SHACL SPARQL Definitions to implement SHACL in rdf4jAnother typical example is poor man's RDFS inference:
rdf:type/rdfs:subClassOf*
American Art Collaborative discusses using
rdf:List
for "people depicted, with order". In this case a natural query would becrm:P62_depicts/(rdf:rest*/rdf:first)?
but should be rewritten tocrm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first
It seems to me that this can be fixed by internally rewriting to an alternative (union) path: one without the optional property and another with it.
Implementaton suggestion:
More examples:
This alone will be slow since it's non-negative (
()
is the bad part), but when combined with a positive path, it can be made fast again, eg: