eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
365 stars 163 forks source link

Investigate SPARQL CG Lateral proposal-006 #4315

Open JervenBolleman opened 1 year ago

JervenBolleman commented 1 year ago

Problem description

I would like to investigate how LATERAL could be implemented in RDF4j.

Preferred solution

No response

Are you interested in contributing a solution yourself?

Yes

Alternatives you've considered

No response

Anything else?

No response

JervenBolleman commented 1 year ago

Ok, I got it to parse. But it has many more changes to the JavaCC generated code than I wanted to have. Still ok for a prototype. Next step evaluation :)

kenwenzel commented 1 year ago

Just because I am a bit curious: Is this something that could also be achieved by using the bind operator after the left-hand side to force the evaluation?

I am also interested in controlling the evaluation order because we are developing integrations with time-series databases where we first query structural data from an RDF store and then related time-series data: https://github.com/linkedfactory/specification

JervenBolleman commented 1 year ago

@kenwenzel I don't think you can use bind to get to the same behavior.

Anyway this is to point of having failing test cases. Everyone is welcome to poke at this and have a go at improving it.

JervenBolleman commented 1 year ago

@frensjan could you have a look at this. You seem to have a good grasp of the grammar and how it is supposed to work.

frensjan commented 1 year ago

It's on my 'long-list' of interesting features, but I have some other fish to fry at the moment.

frensjan commented 9 months ago

I've fried some other fish 😂 We've been implementing a SPARQL layer over our existing data-platform (with statement like data in HBase and entity-centric documents in Elasticsearch). I already had an interest in LATERAL since we have quite some situations where we'd like to query data for e.g. each hit in a search.

So I'm going to investigate with my team whether we can pull this off.

@JervenBolleman, was there any progress from your side? Any pointers that would be relevant here? @VladimirAlexiev, as the initiator of the issue in sparql-dev, was there anything done for this in GraphDB?

@abrokenjester, do you see proverbial bears on the road? I understood that there have been issues with scopes in the past. The LATERAL operator changes some things here.

@hmottestad, also looping you in here.

Would you guys welcome an implementation of LATERAL in RDF4J? (under the condition that it's correct of course) Should we target develop (v4) or main (v3) for such a contribution?

JervenBolleman commented 9 months ago

@frensjan Not since I did this first pass. So the code is wired up but I need to revisit why I did not get the semantics right.

frensjan commented 9 months ago

@JervenBolleman, could you help me with a more thorough understanding of SEP-0007 and how it relates to the introduction of LATERAL in RDF4J?

I get the basic gist of inject versus substitute. However, I doubt if this is actually an issue with RDF4J. I think that RDF4J already applies the semantics proposed, or at least to a large extent.

Note that if RDF4J already behaves according to SEP-0007, which I think it does, this would really simplify the implementation of LATERAL query evaluation in RDF4J!


The three parts of the proposal and how I think these relate to the RDF4J implementation:

1. Current row as 'initial binding set'

Place the current row mapping variables to values (the RDF terms) so that the variables always get their values from the current row. This is the replacement for syntactic substitution in the original definition.

To me, this seems to corresponds to how RDF4J performs evaluation steps given an initial binding set.

2. Only selected variables in sub-selects correlate

Rename inner scope variables so that variables that are only used within a sub-query are not affected by the current row. This reflects the fact that in SPARQL such variables are not present in solution mappings outside their sub-query.

I see this behaviour in RDF4J, but I'm not 100% as to why this is the case under all circumstances though. I think this is due to the the QueryJoinOptimizer which re-orders sub-selects before other parts of a graph pattern and the fact that any joins with sub-selects and/or scope changes are evaluated with a hash-join.

As an aside, I think this leaves some performance improvements on the table; e.g., for queries such as:

prefix : <http://example/>
select * where {
    ?s :label "x"
    {
        ?s :label "y"
    }
}

3. Disallow re-binding

Disallow syntactic forms that set variables that may already be present in the current row. SPARQL solution mappings can only have one binding for a variable, and the current row provides that binding.

This is probably reflected in various parts RDF4J query evaluation. But at least in tests covered by this assertion in QueryBindingSet:

assert !bindings.containsKey(name) : "variable already bound: " + name;
VladimirAlexiev commented 9 months ago

Hi @frensjan! Thanks for working on this!

was there anything done for this in GraphDB?

Yes, we do that for https://platform.ontotext.com/semantic-objects/index.html, which implements GraphQL by transpiling to SPARQL. LATERAL is absolutely necessary to implement GraphQL correctly, as motivated in https://github.com/w3c/sparql-dev/issues/100.

It uses an empty VALUES keyword. It was described at https://www.ontotext.com/blog/graphdb-users-ask-does-graphdb-support-reusing-values-in-sub-selects/ but is now hidden since it's non-standard SPARQL.

Would you guys welcome an implementation of LATERAL in RDF4J? (under the condition that it's correct of course)

Sure!

Should we target develop (v4) or main (v3)

Of course v4. Current GraphDB uses RDF4J 4.3.8 (or am I misunderstanding something?)

  1. Only selected variables in sub-selects correlate

I raised https://github.com/w3c/sparql-dev/issues/194

frensjan commented 9 months ago

No problem; it's of direct interest to us for our SPARQL over Elastic + HBase implementation!

We've got a working implementation; hacked into RDF4J 4.3.8 by rewriting a SERVICE <...:lateral> {} clause as LATERAL because we didn't want to fork, but it was a blocker for a particular use case involving push-down of ORDER BY into Elastic. It's in production already 🎉

The current implementation is very similar to the JoinIterator in RDF4J. A concurrent and pipelined implementation is on the roadmap (we need to mask the latency of HBase and Elasticsearch). But for the initial use cases it's fast enough.

We're having some issues with RDF 4.3.9 though; because of how filter scopes have been changed in https://github.com/eclipse-rdf4j/rdf4j/issues/4769. https://github.com/eclipse-rdf4j/rdf4j/issues/4872 is also linked to this issue.


Should we target develop (v4) or main (v3)

Of course v4. Current GraphDB uses RDF4J 4.3.8 (or am I misunderstanding something?)

Sorry, I meant v4 (main) or v5 (develop). I get the impression that v5 is quite the change from v4 and I would like to steer away from any deprecation, back-porting, etc. issues. I understood from @hmottestad that a v5 release isn't that far away.


  1. Only selected variables in sub-selects correlate

I raised https://github.com/w3c/sparql-dev/issues/194

Good to have this discussion. I agree that it's not super intuitive. I also haven't found al that clear arguments against just binding all variables on the RHS to values from the LHS.

hmottestad commented 9 months ago

Just finishing up JSON-LD 1.1 support, which will be the final feature of RDF4J 5.0.0. So getting very close. Hopefully a final milestone build in the new couple of weeks.

Would very much welcome support for LATERAL!

VladimirAlexiev commented 9 months ago

@frensjan I can see clearly how LATERAL can be useful in cross-storage (RDF+Elastic) querying scenarios.

GraphDB can issue Elastic queries from SPARQL and can use Elastic specific ordering and limits; then you can join with RDF for further filtering (joining). But you can never be sure whether the Elastic limit will be "enough" to return enough solutions after applying the RDF join.

frensjan commented 9 months ago

Definitely!

A bit off topic, and sorry for the lengthy reply, but perhaps interesting to know what our angle is.

TL;DR

Our lateral usage:

LIMIT:

Our use of SPARQL

LATERAL is useful, but definitely not the only use case for ES + SPARQL for us. We have indices with 'entity centric documents' with (for all practical purposes) all 'properties' and 'relationships' per entity. HBase contains tall tables with data similar to statements per row, but with a 'source ID' per statement. Note that this database arrangement pre-dates our SPARQL efforts; we're layering SPARQL on top to allow more advanced queries

We integrate HBase and Elasticsearch 'storage' (mostly) transparently for our users. During optimization of the SPARQL algebra we select which statement patterns can be evaluated using the 'entity index' and merge such patterns that share the same subject.

Lateral usage

LATERAL is currently used for a few things:

To apply limits at the entity level

Kind of obvious, but we have a lot of use cases where we want to show information for a certain (maximum) number of entities. So we apply LIMIT to a query to select entities and then use LATERAL to query information on them.

# Select title(s) and category/ies for 10 posts that match a certain query:

SELECT * {
  {
    SELECT DISTINCT ?post {
       ?post a :post .
       ?post search:matches [
         search:query "SPARQL is cool"
       ]
    }
    LIMIT 10
  }

  LATERAL {
    ?post :title ?title .
    ?post :category ?category .
  }
}

LIMIT and aggregations

Sometimes we want to select only the last value of some property using the source (~provenance) information in HBase. E.g. as with:

# Select the last content for post.

SELECT * {
    ?post ... 
    LATERAL {
        SELECT ?post ?content {
            << ?post :content ?content >> :source ?source .
            ?source :timestamp ?timestamp .
        }
        ORDER BY DESC( ?timestamp )
        LIMIT 1
    }
}

This is obviously only one particular aggregation and others can be useful.

To allow a particular optimization for pushing down ordering.

We currently only push down ordering when a (sub) query maps to a entity document query. E.g.:

# Select the ten most liked posts.

SELECT DISTINCT ?post {
   ?post a :post .
   ?post :likes ?likes .
}
ORDER BY DESC( ?likes )
LIMIT 10

We use LATERAL when we want to do filtering based on variables from another query. E.g.:

# Select the ten most liked posts in the same thread as some `?start` post. 

SELECT * {
  BIND( <...> as ?start )
  ?start :thread ?thread .

  LATERAL {
    SELECT DISTINCT ?thread ?post {
       ?post a :post .
       ?post :thread ?thread .
       ?post :likes ?likes .
    }
    ORDER BY DESC( ?likes )
    LIMIT 10
  }
}

You might expect a post to be in only one thread, but as we don't have any cardinality constraints (yet) e.g. from OWL or SHACL, the order push down optimizer wouldn't be able to guarantee that so for this query the ordering can't be pushed into Elastic.

# Example query we don't do order by push-down.

SELECT DISTINCT ?post {
  BIND( <...> as ?start )
  ?start :thread ?thread .
  # unable to prove that the above part of the graph pattern yields only one solution

  ?post a :post .
  ?post :thread ?thread .
  ?post :likes ?likes .
}
ORDER BY DESC( ?likes )
LIMIT 10

Perhaps some things can be done here. But the query with LATERAL is a lot easier to optimize and hopefully also easier to comprehend by our users.

LIMIT

We currently don't push down LIMIT. Our access path for the entity indices is using paging with pretty large pages, so currently we 'over fetch' most of the time as most queries have much lower LIMIT than what's included in single page.

But you can never be sure whether the Elastic limit will be "enough" to return enough solutions after applying the RDF join.

We are considering this. E.g. for a query such as:

SELECT * {
    ?post a :post ;
              :thread <thread> ;
              :title ?title ;
              :created ?created ;
              :content ?content .
}
ORDER BY DESC( ?created )
LIMIT 10

would be transformed into an Elasticsearch query that would request entity documents which have exists queries for thread, title and content. So any document would yield at least one solution (or more).

That said, most queries would look like:

SELECT * {
  {
    SELECT DISTINCT ?post {
        ?post a :post ;
                  :thread <thread> ;
                  :title ?title ;
                  :created ?created ;
                  :content ?content .
    }
    ORDER BY DESC( ?created )
    LIMIT 10
  }

  LATERAL {
    ?post :title ?title ;
               :created ?created ;
               :content ?content .
  }
}

In this case the LIMIT would correspond exactly with the number of documents to query from Elasticsearch.