apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 653 forks source link

LATERAL join #1615

Closed afs closed 1 year ago

afs commented 2 years ago

LATERAL join

Proposed experimental feature.

Work-in-progress: the issue description is being edited in-place.

A LATERAL join is like a foreach loop, looping on the results from the left-hand side (LHS), the pattern before the LATERAL keyword, and executing the right-hand side (RHS) query pattern once for each row, with the variables from the RHS in-scope during each RHS evaluation.

A regular join only executes the RHS once, and the variables from the LHS are only used for the join condition after evaluation of the left and right sub-patterns.

Another way to think of a lateral join is as a flatmap.

Examples:

## Get exactly one label for each subject in a row.
SELECT * {
   ?s ?p ?o
   LATERAL {
     SELECT * { ?s rdfs:label ?label } LIMIT 1
   }
}
## Get zero or one labels for each subject.
SELECT * {
   ?s ?p ?o
   LATERAL { OPTIONAL { SELECT * { ?s rdfs:label ?label } LIMIT 1} }
}

{ OPTIONAL ... is the same as writing { {} OPTIONAL .... { } evaluates to the join identity, a table of one row of zero columns.

Syntax

The LATERAL keyword which has the graph pattern so far (from the { starting the current block) and a { } block afterwards.

Possible addition: LATERAL ( ?var1 ?var2 ...) to specify certain variables to expose to the RHS. Other variables would be (inner)joined as usual. This may be an unnecessary feature.

Scope

A sub-select may have variables of the same name that are not lateral-joined to a variable of the same name from the LHS.

SELECT * {
   ?s ?p ?o
   LATERAL {
     SELECT ?label { ?s rdfs:label ?label } LIMIT 1
   }
}

The inner ?s in the SELECT ?label is not the outer ?s because the SELECT ?label does not pass out ?s. As a sub-query the ?s could be any name except ?label for the same results.

This is the same situation as a sub-query in other situations.

There needs to be a new syntax restriction: there can no variable introduced by AS (BIND, or sub-query) or VALUES in-scope at the top level of the LATERAL RHS, that is the same name as any in-scope variable from the LHS.

## ** Illegal **
SELECT * {
   ?s ?p ?o
   LATERAL { BIND( 123 AS ?o) }
}

See SPARQL Grammar note 12.

In ARQ, LET would work. LET for a variable that is bound acts like a filter.

Evaluation

Substituting variables from the LHS into the RHS (with the same restrictions), then executing the pattern, gives the evaluation of LATERAL

Notes

There is a similarity to filter NOT EXISTS/EXISTS expressed as the not-legal FILTER ( ASK { pattern } ) where the variables of the row being filtered are available to "pattern". This is similar to ab SQL correlated subquery.

Elsewhere

Spec updates

Syntax

LATERAL is added to the SPARQL grammar at rule [[56] GraphPatternNotTriples](https://www.w3.org/TR/sparql11-query/#rGraphPatternNotTriples). As a syntax form, it is similar to OPTIONAL.

[56]    GraphPatternNotTriples    ::=   GroupOrUnionGraphPattern | OptionalGraphPattern | LateralGraphPattern | ...
[57]    OptionalGraphPattern      ::=   'OPTIONAL' GroupGraphPattern
[  ]    LateralGraphPattern   ::=   'LATERAL' GroupGraphPattern

Algebra

The new algebra is operator is lateral which takes two expressions

  SELECT * {
    ?s  ?p  ?o
    LATERAL
      { ?a  ?b  ?c }
}

is translated to:

  (lateral
    (bgp (triple ?s ?p ?o))
    (bgp (triple ?a ?b ?c)))

Evaluation

To evaluate lateral:

Outline:

Definition: Lateral

Let Ω be a multiset of solution mappings. We define:

Lateral(Ω, P) = { μ | union of Ω1 where 
           foreach μ1 in Ω:
               pattern2 = inject(pattern, μ1)
               Ω1 = eval(D(G), pattern2)
           result Ω1
       }

where inject is the corrected substitute operation.

An alternative style is to define Lateral more like "evaluate P such that μ is in-scope" in some way, rather than rely on inject which is a mechanism.

Definition: Evaluation of Lateral

eval(D(G), Lateral(P1, P2) = Lateral(eval(D(G), P1), P2)
kinow commented 2 years ago

Interesting! I learned about lateral joins in a project some time ago, from a developer that was familiar with that concept in Postgres. He explained the same way, that it's similar to a for-loop: https://medium.com/kkempin/postgresqls-lateral-join-bfd6bd0199df

Looking forward to trying it out in Jena.

Tpt commented 2 years ago

Thank you for moving forward with this idea! It would be amazing if we could come up with a compatible feature in Jena and Oxigraph. It might be a good baseline if the SPARQL 1.2 project somehow restarts.

Thank you for the restriction on introduced variables. I have not thought about it.

For sub-selects, I like the idea of bindings only the variables in the SELECT clause and failing if a variable is overridden using AS. Thank you! It looks much better than Oxigraph OX_LATERAL(?s) {SELECT ?s {}}.

A question: I am a bit not at ease with the LATERAL { <pattern> } syntax because LATERAL is modifying pattern <pattern> evaluation, opposite to e.g. OPTIONAL, UNION or MINUS that just explain who the inner pattern should be joint with the previous one. Having e.g. LATERAL OPTIONAL or LATERAL GRAPH would make maybe more sense. What do you think about it?

A side question: how to you plan to represents LATERALs in SPARQL S-Expressions? I am trying to keep Oxigraph as compatible as possible with Jena and Ruby-RDF.

LorenzBuehmann commented 2 years ago

This is a feature many people are missing in SPARQL for a long time, so happy to see it in a long term in the standard and in near time in Jena - we're already making use of the enhanced SERVICE features contributed by @Aklakan in our projects and daily work, so seeing that feature would make SPARQL way more convenient. Who did not try to use SPARQL for something like "give me top X for each Y" like "give me 5 largest cities per country" ...

Minor: In Scope section

A sub-select may have variables of the same name that are not lateral-joined to a variable of the same name from the RHS.

Do you really mean RHS or is it LHS? @afs

afs commented 2 years ago

Do you really mean RHS or is it LHS?

Fixed.

afs commented 2 years ago

Description updated with an outline of integration into section 18 : Definition of SPARQL and section 19 : SPARQL Grammar.

afs commented 2 years ago

@tpt - thanks for the comments.

In the OX_LATERAL ?v1 ?v2 those are inputs to the evaluation/ What appens if the evaluation retuns say ?x and ?x is also in the LHS? Does it do an inner join at that point?

how to you plan to represents LATERALs in SPARQL S-Expressions?

The current design is to have a operator in the algebra lateral. I'll PR a working branch (it's in my development clone ATM).

While it is related to sequence of one or two elements, my preference is to have a separate operator which can have it's own definition. sequence is specific to the ARQ optimizer and implies certain conditions on its arguments.

LATERAL is modifying pattern evaluation, opposite to e.g. OPTIONAL, UNION or MINUS

An existing example that is not so different is FILTER ( NOT EXISTS { ... } ).

At the moment, I'm looking to what are the fundamental operations. Is there one algebra operator needed? For syntax, it may be worth while having syntax forms for that translate to this/these funamental operations.

May be LATERAL should be FOREACH, EACH, or LOOP. A more direct statement of what it does in SPARQL++ rather than the technical sense of "variables flowing laterally into an pattern evaluation".

FOR { } DO { } breaks up the query appearance and the sense that it is adding variable bindings to previous pattern. c.f. BIND and OPTIONAL. It is not the style (prefix function) elsewhere and if nested the outer is the last not the first part.

In PostgreSQL, LATERAL seems to be a modifier to one of the join operations. To some, people, it's not then a join operator but a web search finds "lateral join" such usage. It is still a "combining operator".

In that style, there could be OPTIONAL LATERAL, c.f. LEFT JOIN LATERAL, or reversed which IMO reads better in SPARQL, but wouldn't here still be LATERAL {... because in SPARQL INNER JOIN is the empty string :-).

It is not a strict functional from (but again, this is not the first in SPARQL - as well as EXISTS, there are several expressions that are not strict functions; IF, COALESCE, BOUND and some others).

From the point of view of fundamental operations, LATERAL OPTIONAL looks to be the same as LATERAL { OPTIONAL { ... } } but its early days and more investigation necessary.

Do you have some examples we can explore?

If this is the case, and these are common, these can be syntax forms that translate to the same lateral algebra operator.

Digression: It would be nice to have "SELECT-less" (sub)queries: { ?s ?p ?o } LIMIT 1 But one thing at a time.

Aklakan commented 2 years ago

Possible addition: LATERAL ( ?var1 ?var2 ...)

For interoperability, it would be good if there was agreement for whether variable lists are unsupported, optional or mandatory. They are not strictly necessary, but someone may find this useful e.g. for extra control when copy/pasting a long graph pattern into the rhs of a lateral join - or maybe even just for clarity. Conversely, optional variable lists would give more brevity especially when dealing with simple graph patterns.

To me it seems an 'optional' variable list would be most convenient to use.

Also, proper lateral join support would solve an issue with the service enhancer which @LorenzBuehmann found out:

So proper lateral is very welcome!

Tpt commented 2 years ago

The current design is to have a operator in the algebra lateral. I'll PR a working branch (it's in my development clone ATM). While it is related to sequence of one or two elements, my preference is to have a separate operator which can have it's own definition. sequence is specific to the ARQ optimizer and implies certain conditions on its arguments.

Thank you! It definitely make sense.

LATERAL is modifying pattern evaluation, opposite to e.g. OPTIONAL, UNION or MINUS An existing example that is not so different is FILTER ( NOT EXISTS { ... } ).

That's a good point! Indeed, we have to already introduced the substitution operation for FILTER NOT EXISTS so having it inside of the "lateral" operator definition make sense. This gives a good definition of LATERAL { and, so, it make sense to have it as a "standalone" operator and not as a modifier of "optional", "graph"... This is the piece I was missing. I am now convinced by your proposal and it looks like the best design I have seen at the moment.

Digression: It would be nice to have "SELECT-less" (sub)queries: { ?s ?p ?o } LIMIT 1

Yes, to avoid the SELECT * but it's minor syntactic sugar imho.

afs commented 2 years ago
  • OPTIONAL { SERVICE { X } } becomes OpConditional which results in an exception during algebra-to-syntax reversal in OpAsQuery.

Unrelated.

Whether perfect reversal is possible isn't clear - some kind of reversal isn't looking too hard but there's an interaction with LeftJoin conditions (expressions) so query equality round-trip is unlikely to be a simple matter to provide.

You'd maybe better working with unoptimized algebra. Ultimately: optimized => less likely OpAsQuery can reverse an algebra expression.

rubensworks commented 2 years ago

So if I understand correctly, LATERAL is like an (inner) join, but where top-down execution (LHS->RHS) is enforced, as opposed to the default bottom-up evaluation when working with sub-queries.

In PostgreSQL, LATERAL seems to be a modifier to one of the join operations. To some, people, it's not then a join operator but a web search finds "lateral join" such usage. It is still a "combining operator".

In that style, there could be OPTIONAL LATERAL, c.f. LEFT JOIN LATERAL, or reversed which IMO reads better in SPARQL, but wouldn't here still be LATERAL {... because in SPARQL INNER JOIN is the empty string :-).

Applying this approach on different logical join types definitely makes sense to me. So considering LATERAL a join modifier seems useful.

namedgraph commented 2 years ago

Does it mean that "SPARQL is evaluated bottom-up" would not be true anymore?

Aklakan commented 2 years ago

Does it mean that "SPARQL is evaluated bottom-up" would not be true anymore?

Yes, lateral adds the feature "For each binding on the left hand side substitute on the right-hand-side and only then evaluate it". So it adds "left-to-right" evaluation. To clarify: the lhs is first evaluated bottom up as usual, then for each lhs-result the rhs is substituted accordingly, and then the substituted rhs is also evaluated bottom up as usual - so it's NOT about overthrowing the semantics of sparql.

afs commented 2 years ago

Does it mean that "SPARQL is evaluated bottom-up" would not be true anymore?

Have to be careful here.

It's the operator that determines the evaluation, there isn't some policy for the whole expression. Just most current algebra operators are depth first evaluation (AKA functions) and we all say "evaluated bottom-up".

(join A B) is bottom up because join says "evaluate the arguments separately then join the resulting tables". It is the "evaluate the arguments then ..." which makes it a well-behaved function; it is doing a depth first walk "bottom-up".

(lateral A B) says "eval A, then loop on its rows, evaluating B such that the row from A is available for variables". Defining the "row being available to A" as careful injection means eval A[row] is normal SPARQL evaluation, not a special case for inside LATERAL.

The proposal is that the row is injected (by the corrected substitute operation) - the variable name is still there but it's binding is fixed by having a BIND just before it. There are places that require a variable work e.g. SELECT ?var or FILTER(bound(?var)) where replacing a variable by it's value fails.


There is a discussion point about whether "eval B with row from A" should or should not use the in-scope rules for variables:

SELECT * {
   ?s ?p ?o
   LATERAL {
     SELECT ?label { ?s rdfs:label ?label } LIMIT 1
   }
}

Does the ?s in { ?s rdfs:label ?label } connect to the ?s before the LATERAL?

From a SPARQL POV, that sub-query can otherwise be SELECT ?label { ?z rdfs:label ?label } LIMIT 1 or SELECT ?label { [] rdfs:label ?label } LIMIT 1 for the same results. ?label is unrelated to the LHS triple because ?s isn't in the SELECT.

SELECT ?s ?label { ?s rdfs:label ?label } LIMIT 1 does make the ?s see the ?s of ?s ?p ?o. Ditto SELECT * { ?s rdfs:label ?label } LIMIT 1.

Just for LATERAL, it could be "no scope rules" and the inner ?s does see the LHS ?s.

At the moment, I'm more inclined to the scoping version so that there isn't a eval special case of "inside LATERAL" and making developing big queries piece-by-piece more predicable (arguably), but it does cause a "surprise" case. Another reason is that special cases tend to have complicated consequences.

When/if we have query template and parameterization, unconditionally replacing ?s by an RDF term makes sense and easier for users to comprehend.

Aklakan commented 2 years ago

Does the ?s in { ?s rdfs:label ?label } connect to the ?s before the LATERAL?

I'd say "no scope rules" (i.e. unconditional substitution) makes more sense when there is just LATERAL because otherwise aggregations would be complex to write. For example, I think that the query below should give for each class the capped count of its instances:

#Q1:
SELECT * {
  ?type a owl:Class
  LATERAL { SELECT (COUNT(*) AS ?cappedInstanceCount) { SELECT * { ?i a ?type } LIMIT 10000 } }
}

# Optional SELECT would be nice here too:
# LATERAL { SELECT (COUNT(*) AS ?cappedInstanceCount) { { ?i a ?type } LIMIT 10000 } }

What good would do LATERAL if that wasn't the case?

Otherwise, if scoping rules were applied then one would have to weave in the lateral-joining variables into the aggregations - which imho is needlessly cumbersome:

#Q2:
SELECT * {
  ?type a owl:Class
  LATERAL {
    SELECT ?type (COUNT(*) AS ?cappedInstanceCount) { SELECT * { ?i a ?type } LIMIT 10000 } GROUP BY ?type
  }
}
afs commented 2 years ago

Those two queries don't do the same thing :-) The only addition is the ?type in SELECT ?type (... as you would need that if its a plain join. c.f. OxiGraph's LATERAL(vars). The GROUP BY is orthogonal.

The variable isn't called ?type - it's called ?/type - it will have been renamed to a safe global name to avoid it picking up the outer one. If inside another subquery it's ?//type. c.f programming: Global variables vs function parameters.

What happens if ?type is bound in some rows of the LHS and not others?

For users, copying in a standalone query fragment isn't necessary going to work for them because LATERAL can change the results. Lack of predictability limits other extensions. e.g. Named tables - results are calculated outside the LATERAL, so results are with-scoping, but put that query text inside the LATERAL (user intuition - named tables are a way to avoid repeated copies o query patterns ) and you can get different results.

There isn't a perfect answer here. Making LATERAL change the deep scoping rules, and is not just the evaluation of new operator will get complicated, will affect transformations and optimizations and will have surprises.

Aklakan commented 2 years ago

The variable isn't called ?type - it's called ?/type

Well yes, the good thing with the way the Jena implementation handles variable scopes with / makes it easy to define a originalName function that maps e.g. /type and ///type to type.

This way, one can add variable scope alignment as an extra step in the lateral definition - something along the lines of

[[Lateral(L, R)]]_D := { μ | union of Ω1 where 
           visibleLhsVars := visibleVars(L)
           varScopeAlignedPattern := alignVarScopes(R, visibleLhsVars)
           Ω =  [[L]]_D
           foreach μ1 in Ω:
               pattern2 = inject(varAlignedPattern, μ1)
               Ω1 = { μ1 U μ2 | μ2 in [[pattern2]]_D }
           result Ω1
       }

with 
  alignVarScopes(R, varsToAlign) := return a new pattern by susbstituting
                                    each variable v in pattern R with alignVar(v)
where
  alignVar(v, varsToAlign) := if exists w in varsToAlign with originalName(w) = originalName(v)
                              then w; otherwise v

But translating this into something compatible with the sparql spec which only has a notion of in-scope seems painful :/

What happens if ?type is bound in some rows of the LHS and not others?

From a substitution perspective I'd say it is simply not substituted - so in the example below everything would be fetched if ?o was unbound. A binding cannot contain a variable with a "null" value - so the binding on the lhs without ?o and that of the rhs with ?o bound to something would be compatible.

?s a :Foo
OPTIONAL { ?s :relatedTo ?o }
LATERAL { ?o ?x ?y } // If ?o is unbound on the lhs then it's not substituted here - so ?o remains ?o.

Of course one could also argue that the rhs is simply not evaluated for such a binding in that case - but to me that seems more like a special rule then.

afs commented 2 years ago

the Jena implementation handles variable scopes with / makes it easy to define a originalName function that maps e.g. /type and ///type to type.

Not quite. The LATERAL may itself be sub-query nested so it's ?var has been renamed ?/var.

Aklakan commented 2 years ago

The LATERAL may itself be sub-query nested so it's ?var has been renamed ?/var.

Yes, but the variable scope alignment always aligns the rhs with what is present on the lhs based on the original name. If a LATERAL's lhs has ?///s and its rhs has both ?////s and ?///////s then alignment on the rhs substitutes both latter with the former - because the original name is ?s in all cases.

afs commented 1 year ago

https://github.com/apache/jena/issues/1615#issuecomment-1312732122

which results in an exception during algebra-to-syntax

See https://github.com/apache/jena/issues/1628