Claudenw / jena-on-cassandra

An implementation of the Jena storage layer on the Cassandra storage engine.
0 stars 1 forks source link

OpExecutor implementation needed #2

Open Claudenw opened 7 years ago

Claudenw commented 7 years ago

Implementing OpExecutor should make SPARQL queries execute faster.

Subclass and override protected QueryIterator execute(OpBGP opBGP, QueryIterator input)

Implement Joins and OpBGP operations.

Notes: Factors that drive scale include being about to do merge joins over steaming results from Cassandra.

Otherwise a parallel hash join will mean multiple CQL statements can be active at one time but that is more demanding of client-side resources.

Cassandra does not seem to do joins (no "or" clause). The expectation is that the client will do the joins or that the data will be stored so that joins are not needed.

I have taken to thinking about Cassandra as a massive collection of named graphs and having the ability to extract sub graphs from that collection that will then be queried for the solution.

So in the OpExecutor design start by breaking up the BGP into groups based on the Subject.

find the subject group that has the most qualified statement (i.e. the statement with the fewest unknowns) and start resolving the subject based on that. Basically pull back a CONSTRUCT ?s ?p ?o WHERE { ?s ?p ?o BIND( ?s "subject" )}. and place that into a temporary local (in memory or small TDB or small SDB) graph. iterate through the groups from the BGP performing the resolution (and adding bindings from the temporary graph) adding the results to the temp graph as we go.

Finally perform a query against the local graph to return all the results properly.

Claudenw commented 7 years ago

TDB has a singleton mapping for StoreConnections - some distinguishing key (usually a string, the canonicalized directory name) to single StoreConnection per location.

The assembler builds the lookup key, not the connection.