GeoscienceAustralia / strat-ontology-graph-API

1 stars 5 forks source link

Convert spreadsheet to RDF #6

Open dr-shorthair opened 4 years ago

dr-shorthair commented 4 years ago

Document the conversion of the Excel spreadsheet data to RDF

ping @wjfrancis

dr-shorthair commented 4 years ago

See

Excel spreadsheet, with additional sheet with Formulae to convert sheet 1 to 'turtle' https://github.com/GeoscienceAustralia/strat-ontology-graph-API/blob/master/strat_graph_data/STRAT_RELATIONS_extended_nogeom_ages_20200611.xlsx

RDF data, cleaned with subsequent SPARQL operations as explained in Ontology comment https://github.com/GeoscienceAustralia/strat-ontology-graph-API/blob/master/strat_graph_data/STRAT_RELATIONS_extended_nogeom_ages_20200611.ttl

dr-shorthair commented 4 years ago

For review @wjfrancis and @jyucsiro

wjfrancis commented 4 years ago

@dr-shorthair Hi Simon, thanks for sharing the files. However, I'm getting a 404 on the xlsx. Perhaps erased with a pull request? If you could post again that would be much appreciated.

jyucsiro commented 4 years ago

i moved it to https://github.com/GeoscienceAustralia/strat-ontology-graph-API/tree/master/source_data

/strat_graph_data should only contain RDF data (otherwise graphDB complains)...

wjfrancis commented 4 years ago

@jyucsiro Thanks!

jyucsiro commented 4 years ago

@dr-shorthair @wjfrancis been looking at exploring the relationship in the output TTL file. it seems like the geometry URI included in the output TTL file uses the ID for stratnames.

e.g. STRATNAME STRATNO ORIGINATOR LITHOLOGY_GROUP STRAT_AGE_EVENT TOPSTRAT_AGE_EVENT MINAGENAME MAXAGENAME PROVINCE_ENO RELATED STRATNAME RELATED_STRATNO RELATION_TYPE
Lady Loretta Formation 10002 R.J.Cavaney carbonate sedimentary rock deposition deposition Statherian Statherian 556523 Shady Bore Quartzite 16787 overlies

is converted into...

stratnames:u10002
  a strat:Unit ;
  strat:bottomAge isc:Statherian ;
  strat:formationProcess proc:deposition ;
  strat:lithology lith:carbonate_sedimentary_rock ;
  strat:qualifiedRelation stratnames:u10002overliesu16787 ;
  strat:qualifiedRelation stratnames:u10002overliesu24472 ;
  strat:qualifiedRelation stratnames:u10002underliesu6282 ;
  strat:relation stratnames:p467256 ;
  strat:relation stratnames:p556523 ;
  strat:relation stratnames:p559187 ;
  strat:relation stratnames:p559201 ;
  strat:terminationProcess proc:deposition ;
  strat:topAge isc:Statherian ;
  strat:wasDefinedBy <http://pid.geoscience.gov.au/agent/RJCavaney> ;
  dcterms:identifier "10002" ;
  dcterms:publisher <https://ror.org/04ge02x20> ;
  geo:hasGeometry <https://gds.loci.cat/geometry/stratnames/u10002> ;
  rdfs:label "Lady Loretta Formation" ;
.

Should we use the PROVINCE_ENO value as the basis for the Geometry URI ?

wjfrancis commented 4 years ago

@jyucsiro @dr-shorthair Yes, that's right, so in this case https://gds.loci.cat/geometry/stratnames/u10002 could be replaced with https://gds.loci.cat/geometry/province/556523

jyucsiro commented 4 years ago

ok - i'm not sure how the ttl was converted. @dr-shorthair would you be able to run that conversion again with PROVINCE_ENO value as the basis for the Geometry URI ?

wjfrancis commented 4 years ago

@jyucsiro @dr-shorthair the property hasGeometry is a fudge here though. The property would be more precisely to the the province entity and then the province would have the hasGeometry property (maybe something to address at a later stage).

wjfrancis commented 4 years ago

@jyucsiro I could do this I think as a test run through of @dr-shorthair 's xlsx to rdf template

jyucsiro commented 4 years ago

@wjfrancis the sparql queries applied were:

replaced agents/blank-nodes with named resources using

INSERT { ?pid a prov:Agent ; rdfs:label ?n . ?c strat:wasDefinedBy ?pid . }
WHERE {
   ?c a strat:Unit ;
         strat:wasDefinedBy ?b .
    ?b rdfs:label ?n .
    BIND( REPLACE( ?n , \"[., ()&]\", \"\") AS ?id )
    BIND( IRI(CONCAT( \"http://pid.geoscience.gov.au/agent/\", ?id)) AS ?pid )
}

DELETE { ?c strat:wasDefinedBy ?b .  }
WHERE {
   ?c a strat:Unit ;
        strat:wasDefinedBy ?b .
    FILTER isblank(?b)
}

Convert relations into non-blank-nodes

INSERT { ?u1u2 a strat:Relationship ; 
   strat:relation ?u2 ;
   strat:relationshipType ?t .
   ?u1 strat:qualifiedRelation ?u1u2 . }
WHERE {
   ?u1 a strat:Unit ;
      strat:qualifiedRelation ?b .
   ?b strat:relation ?u2 ;
      strat:relationshipType ?t .
   ?u1 dcterms:identifier ?i1 .
   ?u2 dcterms:identifier ?i2 .
   BIND(  STRAFTER( STR(?t) , \"http://pid.geoscience.gov.au/def/stratname/relation/\" ) AS ?t12 )
   BIND( IRI(CONCAT( \"http://pid.geoscience.gov.au/dataset/stratnames/\", \"u\" , ?i1 , ?t12 , \"u\" , ?i2 ) ) AS ?u1u2 )
}

DELETE { ?b  ?p ?o . }
WHERE {
   ?b a strat:Relationship ; 
      ?p ?o .
   FILTER isblank (?b)
}
dr-shorthair commented 4 years ago

I consolidated all the rows relating to each unit, so you see there are multiple relations (to provinces) and qualified-relations (to other units) in the descriptions. This means that each unit is related to multiple provinces.

OTOH each qualified-relation is linked to a single province, so perhaps that is the first-class item in the dataset, rather than the unit?

dr-shorthair commented 4 years ago

And yes, it looks like the Province needs to be added as a separate first-class object, which carries the geometry rather than the Unit. That one would be is easy ...

dr-shorthair commented 4 years ago

I had not understood that the only geometries that we had were associated with the provinces. ;-) Can you confirm what the nature of the relationship between Unit and Province and Relationship is? looks like

dr-shorthair commented 4 years ago

Re-process data as follows:

  1. remove geometries from units
    DELETE { ?u geo:hasGeometry ?g } 
    WHERE {
    ?u a strat:Unit; geo:hasGeometry ?g
    }
  2. make relationships fully standalone
    INSERT { ?r strat:context ?c } 
    WHERE {
    ?c strat:qualifiedRelation ?r
    }
  3. create standalone provinces
    INSERT { ?p a strat:Province ; dcterms:identifier ?pi ; geo:hasGeometry ?g . } 
    WHERE {
    ?u a strat:Unit ;
        strat:relation ?p . 
    BIND( STRAFTER( STR(?p) , "http://pid.geoscience.gov.au/dataset/stratnames/p") AS ?pi )
    BIND( IRI( CONCAT(  "https://gds.loci.cat/geometry/province/p" , ?pi ) ) AS ?g )
    }

But questions above still need answers.

wjfrancis commented 4 years ago

Thanks @dr-shorthair. For this subset of data we have

Unit -> Province 1 to many Unit -> Relation 1 to many and the inverse No direct relationship between Relation and Province

image

In the broader set of province records in the GA database the Province to Unit will be Many to 0. However we can focus on this subset (many to 1) as its closer to the future data model we'll use later.

wjfrancis commented 4 years ago

@dr-shorthair @jyucsiro Will need some further assistance with completing the RDF conversion - currently can follow the xlsx formulas and post processing is clear - however in the initial load into triple store I'm missing something

wjfrancis commented 4 years ago

Attempt at conversion looks a bit like this

@prefix data: http://linked.data.gov.au/def/datatype/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix geo: http://www.opengis.net/ont/geosparql# . @prefix isc: http://resource.geosciml.org/classifier/ics/ischart/ . @prefix lith: http://pid.geoscience.ga.gov.au/def/stratname/lithology/ . @prefix owl: http://www.w3.org/2002/07/owl# . @prefix proc: http://pid.geoscience.ga.gov.au/def/stratname/process/ . @prefix prov: http://www.w3.org/ns/prov# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix rel: http://pid.geoscience.ga.gov.au/def/stratname/relation/ . @prefix skos: http://www.w3.org/2004/02/skos/core# . @prefix strat: http://pid.geoscience.ga.gov.au/def/stratname# . @prefix stratnames: http://pid.geoscience.ga.gov.au/dataset/stratnames/ . @prefix time: http://www.w3.org/2006/time# . @prefix xsd: http://www.w3.org/2001/XMLSchema# .

stratnames:u27732 a strat:Unit ; rdfs:label "Campaspe Formation" ; strat:wasDefinedBy [ rdfs:label "" ; ] ; strat:lithology lith:medium-coarse_grained_siliciclastic_rock ; strat:formationProcess proc:deposition ; strat:terminationProcess proc:deposition ; strat:topAge isc:Pleistocene ; strat:bottomAge isc:Pliocene ; strat:relation stratnames:p20430 ; strat:qualifiedRelation [ rdf:type strat:Relationship ; strat:relation stratnames:u24044 ; strat:relationshipType rel:underlies ; ] ; geo:hasGeometry https://gds.loci.cat/geometry/province/u20430 ; dcterms:identifier "27732" ; dcterms:publisher https://ror.org/04ge02x20 ; . stratnames:u27732 a strat:Unit ; rdfs:label "Campaspe Formation" ; strat:wasDefinedBy [ rdfs:label "" ; ] ; strat:lithology lith:medium-coarse_grained_siliciclastic_rock ; strat:formationProcess proc:deposition ; strat:terminationProcess proc:deposition ; strat:topAge isc:Pleistocene ; strat:bottomAge isc:Pliocene ; strat:relation stratnames:p20477 ; strat:qualifiedRelation [ rdf:type strat:Relationship ; strat:relation stratnames:u24044 ; strat:relationshipType rel:underlies ; ] ; geo:hasGeometry https://gds.loci.cat/geometry/province/u20477 ; dcterms:identifier "27732" ; dcterms:publisher https://ror.org/04ge02x20 ; . stratnames:u27732 a strat:Unit ; rdfs:label "Campaspe Formation" ; strat:wasDefinedBy [ rdfs:label "" ; ] ; strat:lithology lith:medium-coarse_grained_siliciclastic_rock ; strat:formationProcess proc:deposition ; strat:terminationProcess proc:deposition ; strat:topAge isc:Pleistocene ; strat:bottomAge isc:Pliocene ; strat:relation stratnames:p408346 ; strat:qualifiedRelation [ rdf:type strat:Relationship ; strat:relation stratnames:u24044 ; strat:relationshipType rel:underlies ; ] ; geo:hasGeometry https://gds.loci.cat/geometry/province/u408346 ; dcterms:identifier "27732" ; dcterms:publisher https://ror.org/04ge02x20 ; . stratnames:u13235 a strat:Unit ; rdfs:label "Mount Windsor Volcanics" ; strat:wasDefinedBy [ rdfs:label "" ; ] ; strat:lithology lith:acid_volcanic_rock ; strat:formationProcess proc:deposition ; strat:terminationProcess proc:deposition ; strat:topAge isc:EarlyOrdovician ; strat:bottomAge isc:LateCambrian ; strat:relation stratnames:p559188 ; strat:qualifiedRelation [ rdf:type strat:Relationship ; strat:relation stratnames:u24044 ; strat:relationshipType rel:overlies ; ] ; geo:hasGeometry https://gds.loci.cat/geometry/province/u559188 ; dcterms:identifier "13235" ; dcterms:publisher https://ror.org/04ge02x20 ; . stratnames:u13235 a strat:Unit ; rdfs:label "Mount Windsor Volcanics" ; strat:wasDefinedBy [ rdfs:label "" ; ] ; strat:lithology lith:acid_volcanic_rock ; strat:formationProcess proc:deposition ; strat:terminationProcess proc:deposition ; strat:topAge isc:EarlyOrdovician ; strat:bottomAge isc:LateCambrian ; strat:relation stratnames:p20497 ; strat:qualifiedRelation [ rdf:type strat:Relationship ; strat:relation stratnames:u24044 ; strat:relationshipType rel:overlies ; ] ; geo:hasGeometry https://gds.loci.cat/geometry/province/u20497 ; dcterms:identifier "13235" ; dcterms:publisher https://ror.org/04ge02x20 ; . stratnames:u23944 a strat:Unit ; rdfs:label "Rollston Range Formation" ; strat:wasDefinedBy [ rdfs:label "R.A.Henderson" ; ] ; strat:lithology lith:siliciclastic_sedimentary_rock ; strat:formationProcess proc:deposition ; strat:terminationProcess proc:deposition ; strat:topAge isc:EarlyOrdovician ; strat:bottomAge isc:LateCambrian ; strat:relation stratnames:p559188 ; strat:qualifiedRelation [ rdf:type strat:Relationship ; strat:relation stratnames:u24044 ; strat:relationshipType rel:underlies ; ] ; geo:hasGeometry https://gds.loci.cat/geometry/province/u559188 ; dcterms:identifier "23944" ; dcterms:publisher https://ror.org/04ge02x20 ; .

dr-shorthair commented 4 years ago

(This might be worth a twirl - https://tarql.github.io/ )

dr-shorthair commented 4 years ago

Are you getting a error message?

wjfrancis commented 4 years ago

@dr-shorthair Thanks! TARQL looks interesting.

I had a few syntax errors and needed to get rid of some additional " etc that excel put in place. I think I've got it sorted now though.

@jyucsiro Updated strat relations ttl (exported from GraphDB post processed with SPARQL queries) available here: https://transfer-temp-scidata.s3-ap-southeast-2.amazonaws.com/STRAT_RELATIONS_POST_PROCEESSED_20200626.ttl

Let me know if that looks right. (It passed the ages sample SPARQL sent through by Simon previously)

jyucsiro commented 4 years ago

@wjfrancis it doesn't look like your converting process worked. For example, the strat unit properties aren't listed amoung other things. Will use what's in the repository for now.

dr-shorthair commented 4 years ago

I did the conversion yesterday https://github.com/GeoscienceAustralia/strat-ontology-graph-API/issues/6#issuecomment-649321348 The revised TTL is in the repo - see https://github.com/GeoscienceAustralia/strat-ontology-graph-API/blob/master/strat_graph_data/STRAT_RELATIONS_extended_nogeom_ages_20200611.ttl Also a minor update to the ontology https://github.com/GeoscienceAustralia/strat-ontology-graph-API/blob/master/rdf/stratname.ttl

jyucsiro commented 4 years ago

@dr-shorthair your ttl file works well. we've been using in tests. i had to also refactor some bits of the REST API but all good now.