SmartDataAnalytics / RdfProcessingToolkit

Command line interface based RDF processing toolkit to run sequences of SPARQL statements ad-hoc on RDF datasets, streams of bindings and streams of named graphs with support for processing JSON, CSV and XML using function extensions
https://smartdataanalytics.github.io/RdfProcessingToolkit/
Other
39 stars 3 forks source link

java.lang.OutOfMemoryError with integrate #41

Closed TBoonX closed 1 year ago

TBoonX commented 1 year ago

I was loading a ttl file (29GB) via a tdb2 and tried to output it into a ttl file. I noticed that the output file was empty at least before 10 minutes before the error.

> java  -Xmx6g -jar ~/Documents/rpt.jar integrate --loc /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db --db-engine=tdb2 output/coytradegraph.all.rpt.ttl spo.rq -o=output/coytradegraph.all.rpt.small.ttl

09:16:31 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:196] - Inferred output format from output/coytradegraph.all.rpt.small.ttl: Turtle/pretty
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:270] - Interpreting argument #1: 'output/coytradegraph.all.rpt.ttl'
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:440] - Detected data format: text/turtle
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:467] - A total of 84 prefixes known after processing output/coytradegraph.all.rpt.ttl
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:270] - Interpreting argument #2: 'spo.rq'
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:438] - Argument does not appear to be (RDF) data because content type probing yeld no result
09:16:31 [INFO] [o.a.j.r.s.SparqlScriptProcessor:372] - Preparing SPARQL statement at line 1, column 1
09:16:31 [INFO] [o.a.c.d.RdfDataEngineFactoryTdb2:75] - Created new directory (its content will deleted when done): /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db
09:16:31 [INFO] [o.a.c.d.RdfDataEngineFactoryTdb2:109] - Connecting to TDB2 database in folder /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db
09:16:31 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:622] - Processing output/coytradegraph.all.rpt.ttl
10:40:21 [INFO] [o.a.s.c.m.SparqlIntegrateCmdImpls:622] - Processing spo.rq:1:1

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "HttpClient-1-SelectorManager"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "HttpClient-2-SelectorManager"

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-3-thread-1"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
10:45:58 [INFO] [o.a.c.d.RdfDataEngineFactoryTdb2:77] - Deleting created directory: /home/kjunghanns/Documents/Coypu/data-sources/baci/db.db
Aklakan commented 1 year ago

Are there many blank nodes in the output? I think they start filling up the memory.

Aklakan commented 1 year ago

Ah, and maybe the output format defaults to turtle/pretty, which collects all data in memory first in order to arrange it for pretty printing it.

Does it help if you specify --out-format turtle/blocks?

Aklakan commented 1 year ago

And you most likely want to use --db-keep while testing in order to not delete the fully loaded database when done. Once it works you can remove the option.

TBoonX commented 1 year ago

--out-format turtle/blocks did resolve the issue. There are no blank nodes.