apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.09k stars 647 forks source link

optional streaming construct? #1633

Open SimonBin opened 1 year ago

SimonBin commented 1 year ago

Version

4.7.0-SNAPSHOT

Feature

As far as I can tell, neither tdbquery nor fuseki allow to stream the CONSTRUCT results, eben though the API exists. Of course there are concerns about duplicate triples etc but it might be a nice and heap space conserving optional function

Are you interested in contributing a solution yourself?

No response

rvesse commented 1 year ago

So having done this for a previous employers CLI tools for their Graph Database that used Jena for the user facing pieces I can say that this is non-trivial to achieve.

That's not to say that is isn't possible merely to highlight that there are a few things to be aware of if someone wanted to attempt this:

  1. You likely want to make this an opt-in behaviour NOT change the existing default behaviour
    • A streaming construct won't suppress duplicate triples so you could get much larger output than expected
    • If the consumer of the output doesn't cope with duplicate triples properly this can break larger data pipelines
  2. If a user opts into this behaviour you need to validate that their selected output format is compatible with streaming.
    • Jena has streaming writers for some languages but not all languages (and this includes some that in theory could have a streaming writer but it would be horrendously verbose e.g. RDF/XML)
      • See WriterStreamRDFPlain (for NTriples), WriterStreamRDFBlocks (for Turtle with limited syntactic sugar), StreamRDF2Thrift and StreamRDF2Protobuf
    • Also worth noting that streaming writers will inherently produce less compressed output, i.e. they can't use all the syntactic sugar of their languages e.g. Turtle predicate object lists, collection shorthands etc, because those require multiple passes over the full data to compute whether those are usable
    • I don't remember if there is a registry for streaming writers (I remember having to hardcode an if structure for this at the time but that was ~8 years ago now), there might be one now (@afs does that exist now?) or it may need introducing
    • You'll need to propagate the query namespace prefixes to the streaming writer somehow since you'll be operating with an Iterator<Triple> that won't have any prefixes available unlike the Model you get from a normal construct evaluation
  3. Then depending on whether you can use a streaming writer or not invoke the relevant execConstruct() vs execConstructTriples() methods and handle the result accordingly
afs commented 1 year ago

I don't remember if there is a registry for streaming writers

There is. StreamRDFWriter.

opt-in behaviour

Yes.

It could be a new (custom) service delivered as a Fuseki module. Simplest case - a server that calls constructTriples and streams back N-triples or one of the Turtle formats that is streaming.

This can be done as a split between a SELECT query stream returning the WHERE clause and a client side processing to apply the template.

That gives the caller a way to control the potentially very large stream that "disappears" in the set semantics of CONSTRUCT.

If they don't care about everything, just the streaming, SELECT REDUCED (or with LATERAL, limited per results). There are options here so pushing all work to fixed algorithm in Fuseki may not that helpful.

The stream could be chunks - or return results to the application in certain orders like same subject - via a combination of SELECT query and chunking results in the client side processing.

LorenzBuehmann commented 1 year ago

Thanks @rvesse and @afs for advice. We stumbled upon this need when trying to export a larger subset of loaded data. Some facts:

Dataset: 257 288 501 triples loaded into TDB2 consuming 52GB disk space Size of subset: 196 423 885 triples resulting in 26GB N-Triples files Using tdb2.tdbquery

with 32GB we got an OOM after 22min

JVM_ARGS="-Xmx32G" tdb2.tdbquery --loc tdb2/siren --query subset.rq --results=N-Triples > subset.nt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at org.apache.jena.mem.HashedBunchMap.newKeyArray(HashedBunchMap.java:39)
        at org.apache.jena.mem.HashedBunchMap.grow(HashedBunchMap.java:99)
        at org.apache.jena.mem.HashedBunchMap.put$(HashedBunchMap.java:90)
        at org.apache.jena.mem.HashedBunchMap.put(HashedBunchMap.java:70)
        at org.apache.jena.mem.NodeToTriplesMapMem.add(NodeToTriplesMapMem.java:51)
        at org.apache.jena.mem.GraphTripleStoreBase.add(GraphTripleStoreBase.java:60)
        at org.apache.jena.mem.GraphMem.performAdd(GraphMem.java:42)
        at org.apache.jena.graph.impl.GraphBase.add(GraphBase.java:169)
        at org.apache.jena.sparql.graph.GraphOps.addAll(GraphOps.java:75)
        at org.apache.jena.sparql.exec.QueryExecDataset.construct(QueryExecDataset.java:187)
        at org.apache.jena.sparql.exec.QueryExec.construct(QueryExec.java:111)
        at org.apache.jena.sparql.exec.QueryExecutionAdapter.execConstruct(QueryExecutionAdapter.java:122)
        at org.apache.jena.sparql.exec.QueryExecutionCompat.execConstruct(QueryExecutionCompat.java:105)
        at org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:197)
        at org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:113)
        at arq.query.lambda$queryExec$0(query.java:237)
        at arq.query$$Lambda$188/0x00007fb183cfd168.run(Unknown Source)
        at org.apache.jena.system.Txn.exec(Txn.java:77)
        at org.apache.jena.system.Txn.executeRead(Txn.java:115)
        at arq.query.queryExec(query.java:234)
        at arq.query.exec(query.java:157)
        at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
        at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
        at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
        at tdb2.tdbquery.main(tdbquery.java:30)

with 64GB assigned it worked in 26min.

Taking the advice from Andy into account, I combined SELECT REDUCED with TARQL:

tdb2.tdbquery --loc tdb2/siren --query subset_select.rq --results=CSV | ../ukch/tarql-1.2/bin/tarql --ntriples --stdin subset_template.tarql subset.csv > tarql_dump.nt

that works without increasing the memory and produces a 31GB N-Triples file containing 235 632 534 triples with a runtime of 19min, i.e. there are lots of duplicates. So, for TARQL you can basically reuse the CONSTRUCT template but have to keep in mind to recreate the IRIs and bind them to new variables. But it works and would be the only option on my laptop for example

afs commented 1 year ago

You could use TSV and use sed to put . on the end of each line.

TSV uses RDF syntax for terms.

LorenzBuehmann commented 1 year ago

Nice option, but this would only work for templates producing a single triple pattern I think. In cases like

CONSTRUCT {
?s :p1 ?o1 ;
    :p2 ?o2 .
?o1 a :A .
} WHERE {
 ....
}

we have to cope with bindings with more than 3 variables and/or missing the fixed properties. But TARQL is fine, it can read from stream

afs commented 1 year ago

Not really - put a UNION for each s/p/o to generate and use LATERAL.

SimonBin commented 1 year ago

afaik the problem with tsv is multiline literals (?) cannot just add . to the end of each line...

afs commented 1 year ago

(?) - did you check :grey_question:

SimonBin commented 1 year ago

I just tried it on a simple example and Jena does not output multiline turtle by default, it uses "...\n", so I guess TSV should be fine

afs commented 1 year ago

Jena doesn't, not even an option. It would break the TSV format.