Open SimonBin opened 1 year ago
So having done this for a previous employers CLI tools for their Graph Database that used Jena for the user facing pieces I can say that this is non-trivial to achieve.
That's not to say that is isn't possible merely to highlight that there are a few things to be aware of if someone wanted to attempt this:
WriterStreamRDFPlain
(for NTriples), WriterStreamRDFBlocks
(for Turtle with limited syntactic sugar), StreamRDF2Thrift
and StreamRDF2Protobuf
if
structure for this at the time but that was ~8 years ago now), there might be one now (@afs does that exist now?) or it may need introducingIterator<Triple>
that won't have any prefixes available unlike the Model
you get from a normal construct evaluationexecConstruct()
vs execConstructTriples()
methods and handle the result accordinglyI don't remember if there is a registry for streaming writers
There is. StreamRDFWriter
.
opt-in behaviour
Yes.
It could be a new (custom) service delivered as a Fuseki module. Simplest case - a server that calls constructTriples
and streams back N-triples or one of the Turtle formats that is streaming.
This can be done as a split between a SELECT query stream returning the WHERE clause and a client side processing to apply the template.
That gives the caller a way to control the potentially very large stream that "disappears" in the set semantics of CONSTRUCT.
If they don't care about everything, just the streaming, SELECT REDUCED
(or with LATERAL, limited per results). There are options here so pushing all work to fixed algorithm in Fuseki may not that helpful.
The stream could be chunks - or return results to the application in certain orders like same subject - via a combination of SELECT query and chunking results in the client side processing.
Thanks @rvesse and @afs for advice. We stumbled upon this need when trying to export a larger subset of loaded data. Some facts:
Dataset: 257 288 501
triples loaded into TDB2 consuming 52GB disk space
Size of subset: 196 423 885
triples resulting in 26GB N-Triples files
Using tdb2.tdbquery
with 32GB we got an OOM after 22min
JVM_ARGS="-Xmx32G" tdb2.tdbquery --loc tdb2/siren --query subset.rq --results=N-Triples > subset.nt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.jena.mem.HashedBunchMap.newKeyArray(HashedBunchMap.java:39)
at org.apache.jena.mem.HashedBunchMap.grow(HashedBunchMap.java:99)
at org.apache.jena.mem.HashedBunchMap.put$(HashedBunchMap.java:90)
at org.apache.jena.mem.HashedBunchMap.put(HashedBunchMap.java:70)
at org.apache.jena.mem.NodeToTriplesMapMem.add(NodeToTriplesMapMem.java:51)
at org.apache.jena.mem.GraphTripleStoreBase.add(GraphTripleStoreBase.java:60)
at org.apache.jena.mem.GraphMem.performAdd(GraphMem.java:42)
at org.apache.jena.graph.impl.GraphBase.add(GraphBase.java:169)
at org.apache.jena.sparql.graph.GraphOps.addAll(GraphOps.java:75)
at org.apache.jena.sparql.exec.QueryExecDataset.construct(QueryExecDataset.java:187)
at org.apache.jena.sparql.exec.QueryExec.construct(QueryExec.java:111)
at org.apache.jena.sparql.exec.QueryExecutionAdapter.execConstruct(QueryExecutionAdapter.java:122)
at org.apache.jena.sparql.exec.QueryExecutionCompat.execConstruct(QueryExecutionCompat.java:105)
at org.apache.jena.sparql.util.QueryExecUtils.doConstructQuery(QueryExecUtils.java:197)
at org.apache.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:113)
at arq.query.lambda$queryExec$0(query.java:237)
at arq.query$$Lambda$188/0x00007fb183cfd168.run(Unknown Source)
at org.apache.jena.system.Txn.exec(Txn.java:77)
at org.apache.jena.system.Txn.executeRead(Txn.java:115)
at arq.query.queryExec(query.java:234)
at arq.query.exec(query.java:157)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:87)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:56)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:43)
at tdb2.tdbquery.main(tdbquery.java:30)
with 64GB assigned it worked in 26min.
Taking the advice from Andy into account, I combined SELECT REDUCED
with TARQL:
tdb2.tdbquery --loc tdb2/siren --query subset_select.rq --results=CSV | ../ukch/tarql-1.2/bin/tarql --ntriples --stdin subset_template.tarql subset.csv > tarql_dump.nt
that works without increasing the memory and produces a 31GB N-Triples file containing 235 632 534
triples with a runtime of 19min, i.e. there are lots of duplicates. So, for TARQL you can basically reuse the CONSTRUCT
template but have to keep in mind to recreate the IRIs and bind them to new variables. But it works and would be the only option on my laptop for example
You could use TSV and use sed
to put .
on the end of each line.
TSV uses RDF syntax for terms.
Nice option, but this would only work for templates producing a single triple pattern I think. In cases like
CONSTRUCT {
?s :p1 ?o1 ;
:p2 ?o2 .
?o1 a :A .
} WHERE {
....
}
we have to cope with bindings with more than 3 variables and/or missing the fixed properties. But TARQL is fine, it can read from stream
Not really - put a UNION for each s/p/o to generate and use LATERAL.
afaik the problem with tsv is multiline literals (?) cannot just add . to the end of each line...
(?)
- did you check :grey_question:
I just tried it on a simple example and Jena does not output multiline turtle by default, it uses "...\n", so I guess TSV should be fine
Jena doesn't, not even an option. It would break the TSV format.
Version
4.7.0-SNAPSHOT
Feature
As far as I can tell, neither tdbquery nor fuseki allow to stream the CONSTRUCT results, eben though the API exists. Of course there are concerns about duplicate triples etc but it might be a nice and heap space conserving optional function
Are you interested in contributing a solution yourself?
No response