optional property paths (*, ?) are very slow

VladimirAlexiev commented 7 years ago

(Created from OWLIM-1104).

Prop paths using optional constructs (p? and p*) are extremely slow. These paths alone are necessarily slow because they must return all nodes. But nobody uses them alone, people always use them in combination with a non-optional path. So one has to use workarounds, eg rewrite q/p? to q|q/p, to get good perfomance.

Use cases:

In Getty queries Vladimir Alexiev uses many rewritten patterns like this:

gvp:placeType/gvp:broaderGenericExtended? ->
gvp:placeType|(gvp:placeType/gvp:broaderGenericExtended)

In OWLIM-2627 Barry Norton suggests a similar but wordier workaround for BM data:
```
?x q/p? ?y ->
{?x q ?y} union {?x q ?z . ?z p ?y}
```
SHACL uses rdf:List for many of its constructs (sometimes unnecessarily) and uses rdf:rest*/rdf:first to unroll the list, eg see ClosedConstraintComponent. Unless fixed, one can't use the SHACL SPARQL Definitions to implement SHACL in rdf4j
Another typical example is poor man's RDFS inference: rdf:type/rdfs:subClassOf*
American Art Collaborative discusses using rdf:List for "people depicted, with order". In this case a natural query would be crm:P62_depicts/(rdf:rest*/rdf:first)? but should be rewritten to crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

It seems to me that this can be fixed by internally rewriting to an alternative (union) path: one without the optional property and another with it.

p?/q -> q|p/q
p*/q -> q|p+/q
q/p? -> q|q/p
q/p* -> q|q/p+

Implementaton suggestion:

add a path characteristic "positive" (cannot be empty) vs "non-negative" (can be empty); the names are by analogy with integers.
design rewrite rules to produce alternatives where the positive path is always present, but the non-negative path alternates

More examples:

p?/q? -> ()|p|q|p/q

This alone will be slow since it's non-negative (() is the bad part), but when combined with a positive path, it can be made fast again, eg:

p?/q?/r ->
  r | (()|p|q|p/q)/r ->
  r | r|p/r|q/r|p/q/r ->
  r|p/r|q/r|p/q/r

VladimirAlexiev commented 7 years ago

necessarily slow

The reason is that ?x prop* ?y must return EVERY node: every node is connected by a zero-length path of any prop to itself.

VladimirAlexiev commented 7 years ago

Some more thoughts on implementation.

First compute path nature (POS, NON) (see prop paths):

iri is POS
^elt is same
elt1 / elt2 is POS if either one is POS, else NON
elt1 | elt2 is POS if both are POS, else NON
elt* is NON
elt+ is same
elt? is NON
!whatever is POS (it's a path of length one)
(elt) is same

Then the following rewrite rules are executed repeatedly. pX are POS paths, nX are NON paths, qX are either, () is the empty path (which is NON).

Shorthands The 2010 prop path language has some extras that boil down to: elt1 ^ elt2 is shorthand for elt1 / ^elt2 elt{n} is shorthand for elt/.../elt (n times without ?) elt{,n} is shorthand for elt?/.../elt? (n times with ?) elt{n,} is shorthand for elt/.../elt / elt* (n times without ? then one with *) elt{n,m} is shorthand for elt/.../elt / elt?/.../elt? (n without ? then m-n with ?)
Prioritize Push POS paths to the left of alternatives (I assume they are evaluated left to right). This matters for LIMIT queries. n|p -> p|n
Expand Alternatives Push alternatives to the top level. ^(q1|q2) -> ^q1 | ^q2 q1/(q2|q3) -> q1/q2 | q1/q3 (q1|q2)/q3 -> q1/q3 | q2/q3 (q1|q2)* -> q1* | q2* (q1|q2)+ -> q1+ | q2+ (q1|q2)? -> q1? | q2? q* -> q+ | () q? -> q | ()
Simplify q|...|q -> q|... (where ... is any number of alternatives) q/() -> q ()/q -> q
Simplify iterations The expansion could create some consecutive iteration symbols that should also be simplified. But because such are allowed in the original path, I guess there already is such simplification: q** -> q* q*+ -> q* q+* -> q* q++ -> q+ q*? -> q* q?* -> q* q+? -> q? q?+ -> q? q?? -> q?

I think that's all that's needed.

VladimirAlexiev commented 7 years ago

Eg if you try the rewrites on Use Case 5 above, you get this sequence crm:P62_depicts/(rdf:rest*/rdf:first)? expand crm:P62_depicts/(rdf:rest+|())/rdf:first)? expand crm:P62_depicts | crm:P62_depicts/(rdf:rest+|())/rdf:first) expand crm:P62_depicts | crm:P62_depicts/(rdf:rest+/rdf:first | ()/rdf:first) expand crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/()/rdf:first simplify crm:P62_depicts | crm:P62_depicts/rdf:rest+/rdf:first | crm:P62_depicts/rdf:first

This is correct and fast enough. But the most natural way to write it is the fastest, because it finds the shortest paths first (and returns elements in the desired order): crm:P62_depicts | crm:P62_depicts/rdf:first | crm:P62_depicts/rdf:rest+/rdf:first

So I think we need these changes:

Prioritize q|() -> ()|q n|p -> p|n only if n != ()

and then:

Eliminate empties ()|... -> ... but only at top level This means that a query ?x p* ?y will be treated as ?x p+ ?y and will NOT return every node. This may fail some conformance test, but is exactly the wanted behavior (whoever wants to find all nodes in a repo?)

abrokenjester commented 7 years ago

Possibly related is issue #689 , which turned out to be partly caused by a bug in binding retention, but also a related query optimization issue. We've improved query optimization for arbitrary-length paths in RDF4J 2.1.4 to the effect that the optimizer will almost always execute the zero-or-more part last (leading to most variables being bound already and therefore far less options to search through). I have not yet looked at your analysis in sufficient detail to figure out whether that corresponds with your findings, but in practical terms it might be good to try your slow queries with the new planner, and see if there's an improvement.

VladimirAlexiev commented 7 years ago

In addition to performance, it's important to return the shortest-path results first, because that is the expected order (e.g. for list unrolling rdf:rest*/rdf:first).

catch-point commented 7 years ago

I confirmed the issue exists in RDF4J 2.1.3, 2.1.6, and 2.2 Using the dataset http://vocab.getty.edu/dataset/aat/full.zip I had results similar to the following on each. prefix gvp:http://vocab.getty.edu/ontology# ask { ?x gvp:broader|(gvp:broader/gvp:broaderGenericExtended) ?y } # ~25ms ask { ?x gvp:broader/gvp:broaderGenericExtended? ?y } # ~2500ms

abrokenjester commented 7 years ago

It's been a while since I looked at this but we should investigate how the planner orders execution of the path elements and how it ensures that as much as possible, it executes paths with bound start or end points first.

As for output ordering: a SPARQL query result is unordered by nature, so any ordering that you rely on (other than an externally imposed one using an ORDER BY clause) is bound to give you problems in the future.

abrokenjester commented 7 years ago

Also surprised that the issue still exists post 2.1.4 (and the fix for #689). Might be worth checking if that fix was somehow accidentally not included in later releases.

VladimirAlexiev commented 7 years ago

As for output ordering: a SPARQL query result is unordered by nature

It won't hurt to have a sensible order, esp for rdf:List

people who use rdf:rest*/rdf:first will get a nasty surprise if it's not
I would guess that Jena's list:member https://jena.apache.org/documentation/query/extension.html#property-functions returns them in order

eclipse-rdf4j / rdf4j

optional property paths (*, ?) are very slow #695