apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.12k stars 653 forks source link

How to import .trig files with named graphs to fuseki-geosparql #1791

Closed navarral closed 1 year ago

navarral commented 1 year ago

Version

4.7.0

Question

Hi, I am trying to load .trig file in fuseki-geosparql that has a named graph, but the .trig file does not load correctly resulting into an empty dataset. However, when Ioad the file in the fuseki-server it works.

Do you have any advice on what to include in the command to make it work for fuseki-geosparql?

Please find below the necessary information to reproduce the error,

I have downloaded jena-fuseki-geosparql-4.7.0.jar from https://repo1.maven.org/maven2/org/apache/jena/jena-fuseki-geosparql/4.7.0/ .

I am using the following command: java -jar jena-fuseki-geosparql-4.7.0.jar -rf "test-geosparql.trig" -i

Error I get is:

[2023-02-27 16:57:47] DatasetOperations INFO Reading RDF - Started - File: test-geosparql.trig, Graph Name: , RDF Format: Turtle/pretty [2023-02-27 16:57:47] riot WARN Only triples or default graph data expected : named graph data ignored [2023-02-27 16:57:47] DatasetOperations INFO Reading RDF - Completed - File: test-geosparql.trig, Graph Name: , RDF Format: Turtle/pretty [2023-02-27 16:57:48] GeoSPARQLOperations INFO Applying GeoSPARQL Schema - Started [2023-02-27 16:57:48] GeoSPARQLOperations INFO GeoSPARQL schema not applied to empty graph: default [2023-02-27 16:57:48] GeoSPARQLOperations INFO Applying GeoSPARQL Schema - Completed [2023-02-27 16:57:48] DatasetOperations WARN Dataset empty. Spatial Index not constructed. Server will require restarting after adding data and any updates to build Spatial Index.

The test-geosparql.trig is the following (I have removed most of the triples just to leave a minimal example):

@prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix xsd: http://www.w3.org/2001/XMLSchema# . @prefix owl: http://www.w3.org/2002/07/owl# . @prefix dct: http://purl.org/dc/terms/ . @prefix qb: http://purl.org/linked-data/cube# .

https://example.org/graph-A { https://example.org/dataset-1 a qb:DataSet ; dct:title "Test Dataset"@en ; . }

rvesse commented 1 year ago

This looks like a limitation of the code, hopefully @galbiston can confirm this.

The code looks to assume that each RDF file is read into a specific model and doesn't properly support reading an entire dataset from a TriG file AFAICT - https://github.com/apache/jena/blob/64f96d76324c615a7a98ec96d6f5d97168141359/jena-fuseki2/jena-fuseki-geosparql/src/main/java/org/apache/jena/fuseki/geosparql/DatasetOperations.java#L147-L162

A workaround would be to create a TDB dataset with other Jena commands first and then use the --tdb option to point to that instead

rvesse commented 1 year ago

Also seems like the code assumes a default format of Turtle regardless of the file format unless you explicitly declare a format (but even then the above noted code would still cause this use case not to work)

https://github.com/apache/jena/blob/64f96d76324c615a7a98ec96d6f5d97168141359/jena-fuseki2/jena-fuseki-geosparql/src/main/java/org/apache/jena/fuseki/geosparql/cli/RDFFileParameter.java#L53-L59

galbiston commented 1 year ago

As @rvesse has noted the file reading relies on graph name being specified rather than read from the file.

It should be possible to set the graph name using the arg:

-rf test-geosparql.trig#https://example.org/graph-A&trig

There is also this warning from RIOT though:

[2023-02-27 16:57:47] riot WARN Only triples or default graph data expected : named graph data ignored

Setting up the data in TDB first is another option and then launching with -t arg to use the TDB dataset: -t <my_tdb> for TDB or -t <my_tdb> -t2 for TDB2. There are several options that may be wanted on first use to add additional GeoSPARQL triples.

navarral commented 1 year ago

@rvesse @galbiston thanks for looking into it!

I tried the following but it does not load correctly the test-geosparql.trig file:

  1. I loaded a sample RDF file, that I know it works, to set up a dataset in a "TestTDB"

java -jar jena-fuseki-geosparql-4.7.0.jar -rf "EU-nuts-rdf-geosparql.ttl" -t "TestTDB" -i

  1. I tested a GeoSPARQL query and it works just fine but then, I try to upload the .trig file following your guidance:

java -jar jena-fuseki-geosparql-4.7.0.jar -rf "test-geosparql.trig#https://example.org/graph-A&trig" -t "TestTDB" -i

but the data does not load. I ran a query and it only finds the first file from 1. . I have also pasted http://localhost:3030/ds to export all the data and it is not there.

[2023-03-09 10:32:26] Main INFO Arguments Received: [-rf, test-geosparql.trig#https://example.org/graph-A&trig, -t, TestTDB] [2023-03-09 10:32:26] DatasetOperations INFO Server Configuration: port=3030, datsetName=ds, loopbackOnly=true, updateAllowed=false, inference=false, applyDefaultGeometry=false, validateGeometryLiteral=false, convertGeoPredicates=false, removeGeoPredicates=false, queryRewrite=true, tdbFile=TestTDB, fileGraphFormats=[FileGraphFormat{rdfFile=test-geosparql.trig, graphName=https://example.org/graph-A&trig, rdfFormat=Turtle/pretty}], fileGraphDelimiters=[], indexEnabled=true, indexSizes=[-1, -1, -1], indexExpiries=[5000, 5000, 5000], spatialIndexFile=null, tdb2=false, transformGeometry=true, help=false [2023-03-09 10:32:26] DatasetOperations INFO TDB Dataset: TestTDB, TDB2: false [2023-03-09 10:32:26] system WARN The “SIS_DATA” environment variable is not set. [2023-03-09 10:32:27] DatasetOperations INFO Reading RDF - Started - File: test-geosparql.trig, Graph Name: https://example.org/graph-A&trig, RDF Format: Turtle/pretty [2023-03-09 10:32:27] riot WARN Only triples or default graph data expected : named graph data ignored [2023-03-09 10:32:27] DatasetOperations INFO Reading RDF - Completed - File: test-geosparql.trig, Graph Name: https://example.org/graph-A&trig, RDF Format: Turtle/pretty

  1. I tried with specifying the conversion to .trig with ">trig", but I got the same results

java -jar jena-fuseki-geosparql-4.7.0.jar -rf "test-geosparql.trig#https://example.org/graph-A&trig>trig" -t "TestTDB" -i

[2023-03-09 10:31:31] Main INFO Arguments Received: [-rf, test-geosparql.trig#https://example.org/graph-A&trig>trig, -t, TestTDB] [2023-03-09 10:31:31] DatasetOperations INFO Server Configuration: port=3030, datsetName=ds, loopbackOnly=true, updateAllowed=false, inference=false, applyDefaultGeometry=false, validateGeometryLiteral=false, convertGeoPredicates=false, removeGeoPredicates=false, queryRewrite=true, tdbFile=TestTDB, fileGraphFormats=[FileGraphFormat{rdfFile=test-geosparql.trig, graphName=https://example.org/graph-A&trig, rdfFormat=TriG/pretty}], fileGraphDelimiters=[], indexEnabled=true, indexSizes=[-1, -1, -1], indexExpiries=[5000, 5000, 5000], spatialIndexFile=null, tdb2=false, transformGeometry=true, help=false [2023-03-09 10:31:31] DatasetOperations INFO TDB Dataset: TestTDB, TDB2: false [2023-03-09 10:31:31] system WARN The “SIS_DATA” environment variable is not set. [2023-03-09 10:31:31] DatasetOperations INFO Reading RDF - Started - File: test-geosparql.trig, Graph Name: https://example.org/graph-A&trig, RDF Format: TriG/pretty [2023-03-09 10:31:31] riot WARN Only triples or default graph data expected : named graph data ignored [2023-03-09 10:31:31] DatasetOperations INFO Reading RDF - Completed - File: test-geosparql.trig, Graph Name: https://example.org/graph-A&trig, RDF Format: TriG/pretty

Do you have any further workarounds or command options that I could try?

Many thanks,

galbiston commented 1 year ago

Sorry, by suggesting setting up the TDB first I was meaning using the tdbloader to import the *.trig file and create the TDB dataset. Then run GeoSPARQL Fuseki pointing at that new dataset. This should give better support for file loading.

navarral commented 1 year ago

Apologies but I am new to Jena and I might have missed something when following the tdbloader documentation. I have tried the following options:

  1. Set up a dataset with geosparql-fuseki and then load with tdbloader:

bin/tdbloader --loc /path/for/database/TestTDB test-geosparql.trig

11:13:13 INFO loader :: -- Start triples data phase 11:13:13 INFO loader :: Load into triples table with existing data 11:13:13 INFO loader :: -- Start quads data phase 11:13:13 INFO loader :: Load into quads table with existing data 11:13:13 INFO loader :: Load: test-geosparql.trig -- 2023/03/09 11:13:13 GMT 11:13:13 INFO loader :: -- Finish triples data phase 11:13:13 INFO loader :: -- Finish quads data phase 11:13:13 INFO loader :: Data: 2 quads loaded in 0.25 seconds [Rate: 7.97 per second] 11:13:13 INFO loader :: -- Start quads index phase 11:13:13 INFO loader :: -- Finish quads index phase 11:13:13 INFO loader :: -- Finish triples load 11:13:13 INFO loader :: -- Finish quads load 11:13:13 INFO loader :: Completed: 2 quads loaded in 0.26 seconds [Rate: 7.78 per second]

But it does not load correctly as I cannot see it there with java -jar jena-fuseki-geosparql-4.7.0.jar -t "TestTDB"

  1. Set up a TDB dataset with bin/tdb1.xloader --loc /path/for/database/TestTDB2 test-geosparql.trig

11:18:49 INFO -- TDB1 Bulk Loader Start 11:18:49 INFO Data Load Phase 11:18:49 INFO Got 1 data files to load 11:18:49 INFO Data file 1: /path/for/file/test-geosparql.trig 11:18:50 INFO loader :: Total: 2 tuples : 0.18 seconds : 10.87 tuples/sec [2023/03/09 11:18:50 GMT] 11:18:50 INFO Data Load Phase Completed 11:18:50 INFO Index Building Phase 11:18:50 INFO Creating Index GSPO 11:18:50 INFO Sort GSPO 11:18:50 INFO Sort GSPO Completed 11:18:50 INFO Build GSPO 11:18:51 INFO Build GSPO Completed 11:18:51 INFO Creating Index GPOS 11:18:51 INFO Sort GPOS 11:18:51 INFO Sort GPOS Completed 11:18:51 INFO Build GPOS 11:18:52 INFO Build GPOS Completed 11:18:52 INFO Creating Index GOSP 11:18:52 INFO Sort GOSP 11:18:52 INFO Sort GOSP Completed 11:18:52 INFO Build GOSP 11:18:53 INFO Build GOSP Completed 11:18:53 INFO Creating Index SPOG 11:18:53 INFO Sort SPOG 11:18:53 INFO Sort SPOG Completed 11:18:53 INFO Build SPOG 11:18:54 INFO Build SPOG Completed 11:18:54 INFO Creating Index POSG 11:18:54 INFO Sort POSG 11:18:54 INFO Sort POSG Completed 11:18:54 INFO Build POSG 11:18:54 INFO Build POSG Completed 11:18:54 INFO Creating Index OSPG 11:18:54 INFO Sort OSPG 11:18:54 INFO Sort OSPG Completed 11:18:54 INFO Build OSPG 11:18:55 INFO Build OSPG Completed 11:18:55 INFO Index Building Phase Completed 11:18:55 INFO -- TDB1 Bulk Loader Finish 11:18:55 INFO -- 6 seconds

But then, when I run java -jar jena-fuseki-geosparql-4.7.0.jar -t "TestTDB2" I get the following error

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. [2023-03-09 11:26:55] Main INFO Arguments Received: [-t, TestTDB2] [2023-03-09 11:26:55] DatasetOperations INFO Server Configuration: port=3030, datsetName=ds, loopbackOnly=true, updateAllowed=false, inference=false, applyDefaultGeometry=false, validateGeometryLiteral=false, convertGeoPredicates=false, removeGeoPredicates=false, queryRewrite=true, tdbFile=TestTDB2, fileGraphFormats=[], fileGraphDelimiters=[], indexEnabled=true, indexSizes=[-1, -1, -1], indexExpiries=[5000, 5000, 5000], spatialIndexFile=null, tdb2=false, transformGeometry=true, help=false [2023-03-09 11:26:55] DatasetOperations INFO TDB Dataset: TestTDB2, TDB2: false [2023-03-09 11:26:56] system WARN The “SIS_DATA” environment variable is not set. [2023-03-09 11:26:56] GeoSPARQLOperations INFO Find Mode SRS - Started [2023-03-09 11:26:56] GeoSPARQLOperations INFO Find Mode SRS - Completed [2023-03-09 11:26:56] Main ERROR GeoSPARQL Server: Exiting - No SRS found. Check 'http://www.opengis.net/ont/geosparql#hasSerialization' or 'http://www.w3.org/2003/01/geo/wgs84_pos#lat'/'http://www.w3.org/2003/01/geo/wgs84_pos#lon' predicates are present in the source data. Hint: Inferencing with GeoSPARQL schema may be required.: ds

  1. I have tried to import the sample RDF file that works together with the .trig file but I get the same as in 1.

Am I missing something? Could you help me with the list of commands I should be using to create a TDB dataset and to be used in geosparql-fuseki?

Thanks again for your time

galbiston commented 1 year ago

Just to confirmat that the working directory when running the jena-fuseki-geosparql-4.7.0.jar command is the /path/for/database folder? Otherwise you would need to include the path as part of the arg -t "TestTDB2".

Just looking at your minimal sample and there are no GeoSPARQL prefixes. The error Main ERROR GeoSPARQL Server: Exiting - No SRS found. is saying that the server cannot find any recognised geospatial information to complete setup and stopping. You will need to expand your minimal sample to include geospatial information.

SimonBin commented 1 year ago

Everything said here is right. But if you just want to quickly play with geosparql, you can also try @Aklakan 's RdfProcessingToolkit powered by Apache Jena

java -jar rpt-1.9.3-rc3.jar integrate NGraph_GeohiveData.trig  --geoindex --server

then run a geo query:

PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX spatial: <http://jena.apache.org/spatial#>
select distinct ?g ?feature ?geo ?geoLabel ?myPoint ("red" as ?myPointColor) {
  values ?myPoint {
    "Point(-6.22675717146472 53.3681724340328)"^^geo:wktLiteral
  }
  graph ?g {
    ?feature spatial:nearbyGeom(?myPoint 3 uom:kilometre) .
    ?feature rdfs:label ?geoLabel ;
             geo:hasGeometry/geo:asWKT ?geo .
  }
}

you can use https://yasgui.triply.cc/ to visualise the shapes

navarral commented 1 year ago

I expanded the minimal example to include geosparql information but it still did not work.

@prefix rdf:            <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:           <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:            <http://www.w3.org/2001/XMLSchema#> .
@prefix owl:            <http://www.w3.org/2002/07/owl#> .
@prefix geo:            <http://www.opengis.net/ont/geosparql#> .

@prefix dct:            <http://purl.org/dc/terms/> .
@prefix qb:             <http://purl.org/linked-data/cube#> .
@prefix locn:           <http://www.w3.org/ns/locn#> .

# -- Named Graph --------------
<https://example.org/graph-A> {
# -- Data Set -------------------
<https://example.org/dataset-1>
    a           qb:DataSet , geo:Feature ;
    dct:title       "Test Dataset"@en ;
    locn:geometry   <https://example.org/dataset-1-geo> ;
.

<https://example.org/dataset-1-geo> 
    a       geo:Geometry ;
    geo:asWKT   "POINT(10.5341 42.9391)"^^geo:wktLiteral ;
.
}

@galbiston I confirm that the /path/for/database/ is the same in both of the following commands

bin/tdbloader --loc /path/for/database/TestTDB test-geosparql.trig

07:54:51 INFO loader :: -- Start triples data phase 07:54:51 INFO loader :: Load empty triples table 07:54:51 INFO loader :: -- Start quads data phase 07:54:51 INFO loader :: Load into quads table with existing data 07:54:51 INFO loader :: Load: test-geosparql.trig -- 2023/03/13 07:54:51 GMT 07:54:52 INFO loader :: -- Finish triples data phase 07:54:52 INFO loader :: -- Finish quads data phase 07:54:52 INFO loader :: Data: 6 quads loaded in 0.27 seconds [Rate: 22.39 per second] 07:54:52 INFO loader :: -- Start quads index phase 07:54:52 INFO loader :: -- Finish quads index phase 07:54:52 INFO loader :: -- Finish triples load 07:54:52 INFO loader :: -- Finish quads load 07:54:52 INFO loader :: Completed: 6 quads loaded in 0.28 seconds [Rate: 21.43 per second]

java -jar jena-fuseki-geosparql-4.7.0.jar -t "/path/for/database/TestTDB"

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. [2023-03-13 08:05:08] Main INFO Arguments Received: [-t, TestTDB] [2023-03-13 08:05:08] DatasetOperations INFO Server Configuration: port=3030, datsetName=ds, loopbackOnly=true, updateAllowed=false, inference=false, applyDefaultGeometry=false, validateGeometryLiteral=false, convertGeoPredicates=false, removeGeoPredicates=false, queryRewrite=true, tdbFile=TestTDB, fileGraphFormats=[], fileGraphDelimiters=[], indexEnabled=true, indexSizes=[-1, -1, -1], indexExpiries=[5000, 5000, 5000], spatialIndexFile=null, tdb2=false, transformGeometry=true, help=false [2023-03-13 08:05:08] DatasetOperations INFO TDB Dataset: TestTDB, TDB2: false [2023-03-13 08:05:08] system WARN The “SIS_DATA” environment variable is not set. [2023-03-13 08:05:08] GeoSPARQLOperations INFO Find Mode SRS - Started [2023-03-13 08:05:08] GeoSPARQLOperations INFO Find Mode SRS - Completed [2023-03-13 08:05:08] SpatialIndex INFO Saving Spatial Index - Started: /path/for/database/TestTDB/spatial.index [2023-03-13 08:05:08] SpatialIndex INFO Saving Spatial Index - Completed: /path/for/database/TestTDB/spatial.index [2023-03-13 08:05:08] GeosparqlServer INFO GeoSPARQL Server: Running - Port: 3030, Dataset: /ds, Loopback Only: true, Allow Update: false [2023-03-13 08:05:08] Server INFO Start Fuseki (http=3030)

The .trig files seems to be uploaded but not accessible as it returns an empty result:

[2023-03-13 08:05:41] Fuseki INFO [1] POST http://localhost:3030/ds [2023-03-13 08:05:41] Fuseki INFO [1] Query = PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX xsd: http://www.w3.org/2001/XMLSchema# PREFIX qb: http://purl.org/linked-data/cube# SELECT * { ?s ?p ?o } LIMIT 5 [2023-03-13 08:05:41] Fuseki INFO [1] 200 OK (136 ms)

Am I missing something?

@SimonBin thanks for your answer! I will keep it in mind for certain use cases. However, in this particular one, I am looking for an open-source triplestore with GeoSPARQL support (like Jena) to import ~50Milion triples to run GeoSPARQL queries offline as there is some health data involved. Any advice would be welcome.

I would have preferred to use geosparql-fuseki for the full GeoSPARQL support but if it does not support named graphs I might try to use jena-fuseki with GeoSPARQL enabled. Is there a ready to use fuseki server with GeoSPARQL enabled?

SimonBin commented 1 year ago

@navarral if you are working with named graphs then of course select ?s ?p ?o will turn up empty! try ```select { graph https://example.org/graph-A { ?s ?p ?o } }```

navarral commented 1 year ago

You are right! I totally missed that, I was using an old script to test the query. Thanks for pointing it out, it is working now.