Open JervenBolleman opened 1 year ago
Ok, I got it to parse. But it has many more changes to the JavaCC generated code than I wanted to have. Still ok for a prototype. Next step evaluation :)
Just because I am a bit curious:
Is this something that could also be achieved by using the bind
operator after the left-hand side to force the evaluation?
I am also interested in controlling the evaluation order because we are developing integrations with time-series databases where we first query structural data from an RDF store and then related time-series data: https://github.com/linkedfactory/specification
@kenwenzel I don't think you can use bind to get to the same behavior.
Anyway this is to point of having failing test cases. Everyone is welcome to poke at this and have a go at improving it.
@frensjan could you have a look at this. You seem to have a good grasp of the grammar and how it is supposed to work.
It's on my 'long-list' of interesting features, but I have some other fish to fry at the moment.
I've fried some other fish 😂 We've been implementing a SPARQL layer over our existing data-platform (with statement like data in HBase and entity-centric documents in Elasticsearch). I already had an interest in LATERAL
since we have quite some situations where we'd like to query data for e.g. each hit in a search.
So I'm going to investigate with my team whether we can pull this off.
@JervenBolleman, was there any progress from your side? Any pointers that would be relevant here? @VladimirAlexiev, as the initiator of the issue in sparql-dev, was there anything done for this in GraphDB?
@abrokenjester, do you see proverbial bears on the road? I understood that there have been issues with scopes in the past. The LATERAL
operator changes some things here.
@hmottestad, also looping you in here.
Would you guys welcome an implementation of LATERAL in RDF4J? (under the condition that it's correct of course) Should we target develop (v4) or main (v3) for such a contribution?
@frensjan Not since I did this first pass. So the code is wired up but I need to revisit why I did not get the semantics right.
@JervenBolleman, could you help me with a more thorough understanding of SEP-0007 and how it relates to the introduction of LATERAL
in RDF4J?
I get the basic gist of inject versus substitute. However, I doubt if this is actually an issue with RDF4J. I think that RDF4J already applies the semantics proposed, or at least to a large extent.
Note that if RDF4J already behaves according to SEP-0007, which I think it does, this would really simplify the implementation of LATERAL
query evaluation in RDF4J!
The three parts of the proposal and how I think these relate to the RDF4J implementation:
Place the current row mapping variables to values (the RDF terms) so that the variables always get their values from the current row. This is the replacement for syntactic substitution in the original definition.
To me, this seems to corresponds to how RDF4J performs evaluation steps given an initial binding set.
Rename inner scope variables so that variables that are only used within a sub-query are not affected by the current row. This reflects the fact that in SPARQL such variables are not present in solution mappings outside their sub-query.
I see this behaviour in RDF4J, but I'm not 100% as to why this is the case under all circumstances though. I think this is due to the the QueryJoinOptimizer
which re-orders sub-selects before other parts of a graph pattern and the fact that any joins with sub-selects and/or scope changes are evaluated with a hash-join.
As an aside, I think this leaves some performance improvements on the table; e.g., for queries such as:
prefix : <http://example/>
select * where {
?s :label "x"
{
?s :label "y"
}
}
Disallow syntactic forms that set variables that may already be present in the current row. SPARQL solution mappings can only have one binding for a variable, and the current row provides that binding.
This is probably reflected in various parts RDF4J query evaluation. But at least in tests covered by this assertion in QueryBindingSet
:
assert !bindings.containsKey(name) : "variable already bound: " + name;
Hi @frensjan! Thanks for working on this!
was there anything done for this in GraphDB?
Yes, we do that for https://platform.ontotext.com/semantic-objects/index.html, which implements GraphQL by transpiling to SPARQL. LATERAL is absolutely necessary to implement GraphQL correctly, as motivated in https://github.com/w3c/sparql-dev/issues/100.
It uses an empty VALUES
keyword.
It was described at https://www.ontotext.com/blog/graphdb-users-ask-does-graphdb-support-reusing-values-in-sub-selects/ but is now hidden since it's non-standard SPARQL.
Would you guys welcome an implementation of LATERAL in RDF4J? (under the condition that it's correct of course)
Sure!
Should we target develop (v4) or main (v3)
Of course v4. Current GraphDB uses RDF4J 4.3.8 (or am I misunderstanding something?)
- Only selected variables in sub-selects correlate
No problem; it's of direct interest to us for our SPARQL over Elastic + HBase implementation!
We've got a working implementation; hacked into RDF4J 4.3.8 by rewriting a SERVICE <...:lateral> {}
clause as LATERAL
because we didn't want to fork, but it was a blocker for a particular use case involving push-down of ORDER BY
into Elastic. It's in production already 🎉
The current implementation is very similar to the JoinIterator
in RDF4J. A concurrent and pipelined implementation is on the roadmap (we need to mask the latency of HBase and Elasticsearch). But for the initial use cases it's fast enough.
We're having some issues with RDF 4.3.9 though; because of how filter scopes have been changed in https://github.com/eclipse-rdf4j/rdf4j/issues/4769. https://github.com/eclipse-rdf4j/rdf4j/issues/4872 is also linked to this issue.
Should we target develop (v4) or main (v3)
Of course v4. Current GraphDB uses RDF4J 4.3.8 (or am I misunderstanding something?)
Sorry, I meant v4 (main) or v5 (develop). I get the impression that v5 is quite the change from v4 and I would like to steer away from any deprecation, back-porting, etc. issues. I understood from @hmottestad that a v5 release isn't that far away.
- Only selected variables in sub-selects correlate
Good to have this discussion. I agree that it's not super intuitive. I also haven't found al that clear arguments against just binding all variables on the RHS to values from the LHS.
Just finishing up JSON-LD 1.1 support, which will be the final feature of RDF4J 5.0.0. So getting very close. Hopefully a final milestone build in the new couple of weeks.
Would very much welcome support for LATERAL!
@frensjan I can see clearly how LATERAL can be useful in cross-storage (RDF+Elastic) querying scenarios.
GraphDB can issue Elastic queries from SPARQL and can use Elastic specific ordering and limits; then you can join with RDF for further filtering (joining). But you can never be sure whether the Elastic limit will be "enough" to return enough solutions after applying the RDF join.
Definitely!
A bit off topic, and sorry for the lengthy reply, but perhaps interesting to know what our angle is.
Our lateral usage:
LIMIT:
LATERAL is useful, but definitely not the only use case for ES + SPARQL for us. We have indices with 'entity centric documents' with (for all practical purposes) all 'properties' and 'relationships' per entity. HBase contains tall tables with data similar to statements per row, but with a 'source ID' per statement. Note that this database arrangement pre-dates our SPARQL efforts; we're layering SPARQL on top to allow more advanced queries
We integrate HBase and Elasticsearch 'storage' (mostly) transparently for our users. During optimization of the SPARQL algebra we select which statement patterns can be evaluated using the 'entity index' and merge such patterns that share the same subject.
LATERAL is currently used for a few things:
Kind of obvious, but we have a lot of use cases where we want to show information for a certain (maximum) number of entities. So we apply LIMIT
to a query to select entities and then use LATERAL
to query information on them.
# Select title(s) and category/ies for 10 posts that match a certain query:
SELECT * {
{
SELECT DISTINCT ?post {
?post a :post .
?post search:matches [
search:query "SPARQL is cool"
]
}
LIMIT 10
}
LATERAL {
?post :title ?title .
?post :category ?category .
}
}
Sometimes we want to select only the last value of some property using the source (~provenance) information in HBase. E.g. as with:
# Select the last content for post.
SELECT * {
?post ...
LATERAL {
SELECT ?post ?content {
<< ?post :content ?content >> :source ?source .
?source :timestamp ?timestamp .
}
ORDER BY DESC( ?timestamp )
LIMIT 1
}
}
This is obviously only one particular aggregation and others can be useful.
We currently only push down ordering when a (sub) query maps to a entity document query. E.g.:
# Select the ten most liked posts.
SELECT DISTINCT ?post {
?post a :post .
?post :likes ?likes .
}
ORDER BY DESC( ?likes )
LIMIT 10
We use LATERAL
when we want to do filtering based on variables from another query. E.g.:
# Select the ten most liked posts in the same thread as some `?start` post.
SELECT * {
BIND( <...> as ?start )
?start :thread ?thread .
LATERAL {
SELECT DISTINCT ?thread ?post {
?post a :post .
?post :thread ?thread .
?post :likes ?likes .
}
ORDER BY DESC( ?likes )
LIMIT 10
}
}
You might expect a post to be in only one thread, but as we don't have any cardinality constraints (yet) e.g. from OWL or SHACL, the order push down optimizer wouldn't be able to guarantee that so for this query the ordering can't be pushed into Elastic.
# Example query we don't do order by push-down.
SELECT DISTINCT ?post {
BIND( <...> as ?start )
?start :thread ?thread .
# unable to prove that the above part of the graph pattern yields only one solution
?post a :post .
?post :thread ?thread .
?post :likes ?likes .
}
ORDER BY DESC( ?likes )
LIMIT 10
Perhaps some things can be done here. But the query with LATERAL
is a lot easier to optimize and hopefully also easier to comprehend by our users.
We currently don't push down LIMIT
. Our access path for the entity indices is using paging with pretty large pages, so currently we 'over fetch' most of the time as most queries have much lower LIMIT
than what's included in single page.
But you can never be sure whether the Elastic limit will be "enough" to return enough solutions after applying the RDF join.
We are considering this. E.g. for a query such as:
SELECT * {
?post a :post ;
:thread <thread> ;
:title ?title ;
:created ?created ;
:content ?content .
}
ORDER BY DESC( ?created )
LIMIT 10
would be transformed into an Elasticsearch query that would request entity documents which have exists queries for thread
, title
and content
. So any document would yield at least one solution (or more).
That said, most queries would look like:
SELECT * {
{
SELECT DISTINCT ?post {
?post a :post ;
:thread <thread> ;
:title ?title ;
:created ?created ;
:content ?content .
}
ORDER BY DESC( ?created )
LIMIT 10
}
LATERAL {
?post :title ?title ;
:created ?created ;
:content ?content .
}
}
In this case the LIMIT
would correspond exactly with the number of documents to query from Elasticsearch.
Problem description
I would like to investigate how LATERAL could be implemented in RDF4j.
Preferred solution
No response
Are you interested in contributing a solution yourself?
Yes
Alternatives you've considered
No response
Anything else?
No response