SemanticComputing / fuseki-docker

Apache Jena Fuseki with SeCo extensions
MIT License
33 stars 14 forks source link

Docker build hangs trying to ingest JSON-LD file #17

Closed stuchalk closed 1 year ago

stuchalk commented 1 year ago

I have over 100,000 JSON-LD files and am trying to load them into TBD. Looking a the congress-legislators example Dockefile I have tried to load one of the JSON-LD files as a test and the loader has hung. What am I doing wrong?

My Dockerfile: FROM secoresearch/fuseki ADD jsonld/ijt_221221.gz /tmp/jsonld/ RUN $TDBLOADER /tmp/jsonld/ijt_v1/s10765-019-2537-x_2.jsonld \ && $TEXTINDEXER \ && $TDBSTATS --graph urn:x-arq:UnionGraph > /tmp/stats.opt \ && mv /tmp/stats.opt /fuseki-base/databases/tdb/

Docker build output:

tbd hung

Thanks. Stuart

PS Any suggestions about loading such a large set of JSON-LD files?

yoge1 commented 1 year ago

On my machine (linux/amd64), I'm able to load the s10765-019-2537-x_2.jsonld file with slight modifications to the Dockerfile you provided.

Dockerfile:

FROM secoresearch/fuseki
ADD --chown=9008 jsonld/ijt_221221.zip /tmp/jsonld/
RUN unzip /tmp/jsonld/ijt_221221.zip -d /tmp/jsonld
RUN $TDBLOADER /tmp/jsonld/ijt_v1/s10765-019-2537-x_2.jsonld \
&& $TEXTINDEXER \
&& $TDBSTATS --graph urn:x-arq:UnionGraph > /tmp/stats.opt \
&& mv /tmp/stats.opt /fuseki-base/databases/tdb/

Before that, to get the JSON-LD data, I ran: git clone https://github.com/chalklab/Dataset-NIST-TRC-JSONLD.git .

Are you sure the file gz file jsonld/ijt_221221.gz contains the JSON-LD file s10765-019-2537-x_2.jsonld? Also, don't you need to decompress the gz file first, if you are trying to load a file named s10765-019-2537-x_2.jsonld?

For bulkloading very large datasets you might want to check Jena's TDB xloader (which isn't currently available as a command in this secoresearch/fuseki container image, but could be easily added): https://jena.apache.org/documentation/tdb/tdb-xloader.html

stuchalk commented 1 year ago

Thanks very much for your reply. I have played around with your code above but have not had time to fully test. Yes, the zip files need to get decompressed before the individual files can be accessed, but that is done automatically if the file in the ADD command is .gz (and others). See https://docs.docker.com/engine/reference/builder/#add.

yoge1 commented 1 year ago

Thanks. I didn't know about the automatic decompression of identity, gzip, bzip2 or xz files when the file is copied with the ADD command, thanks for the info!

Now that your Git repo contains also the .gz files, your original Dockerfile builds on my machine (just had to correct the path /tmp/jsonld/ijt_v1/s10765-019-2537-x_2.jsonld to /tmp/jsonld/s10765-019-2537-x_2.jsonld in the TDBLOADER command) in less than 20 seconds.

I'm closing the issue – please do re-open this issue if the problem persists.

stuchalk commented 1 year ago

Thanks for the comment and feedback that it works if I change the path. Really appreciate that Semantic Computing is making this image available, especially the recent update to the latest version of Fuseki.