co-cddo / ukgov-metadata-exchange-model

A metadata model for describing data assets for exchanging between UK government organisations.
https://co-cddo.github.io/ukgov-metadata-exchange-model/
Other
12 stars 0 forks source link

Capture the geospatial coverage of the data resource #1

Open AlasdairGray opened 1 year ago

AlasdairGray commented 1 year ago

Extend the metadata model to enable the specification of the geospatial coverage and resolution of the data asset. The extension must be compliant with GEMINI (GitHub).

DCAT includes the following properties for capturing geospatial coverage and resolution:

AlasdairGray commented 1 year ago

We could consider whether it is sufficient to reuse the spatial geographies defined by ONS, e.g. to say something has full UK coverage then you could use the URI http://statistics.data.gov.uk/id/statistical-geography/K02000001 or for Greater Manchester use http://statistics.data.gov.uk/id/statistical-geography/E11000001.

AlasdairGray commented 1 year ago

We can look at the way that the ONS GSS-Cogs have captured geospatial coverage, see their guidance.

PeterParslow commented 10 months ago

The range of dcterms:spatial is 0 or more dcterms:Location defined as "A spatial region or named place." (https://www.dublincore.org/specifications/dublin-core/dcmi-terms/terms/Location/)

The EU's GeoDCAT Application Profile is in the process of being "standardised" /adopted by the Open Geospatial Consortium. I expect it will continue to suggest that a Location is populated with either:

https://semiceu.github.io/GeoDCAT-AP/drafts/latest/#properties-for-location

At present, most "geospatial" metadata records (e.g. in data.gov.uk) use a bounding box, in spite of its known weaknesses for 'locating' where data is about.

You can see a GEMINI - DCAT mapping at https://github.com/agiorguk/gemini/issues/41; it was created in a Geospatial Commission funded project (although largely based on a W3C one, given that GEMINI is based on an ISO standard).

PeterParslow commented 9 months ago

Regarding using ONS GSS to specify "where" data is about it rather depends on whether that would be taken to assert that it's "about the whole GSS" rather than just "located in the GSS".

Simple example: a list of all the trees in New Milton parish vs a list of the trees in my garden (which happens to lie within New Milton parish). This isn't a purely ONS GSS question, it's an ambiguity when giving the "location" of data, but could be amplified if that location is expressed in terms of a formally defined "place" (whether statistical or administrative geography).

rossbowen commented 8 months ago

Good point @PeterParslow! At ONS we make use of GSS geography codes wherever we can, and I imagine if a local authority were publishing datasets about their administrative area that they would be well served by using the GSS codes too. I guess the spirit of the dcterms:spatial property is to provide the most minimally sufficient geospatial area which provides a description of the coverage of the dataset.

So for a dataset of trees in New Milton - I'd probably use the GSS identifier for New Milton, but for a dataset of trees in my garden, I'd draw a geometry of my garden and provide that.

The GSS codes have essentially been translated into linked data with a similar structure to what DCAT is recommending (but making use of the geosparql vocab).

<https://data.gov.uk/datasets/example> a dcat:Dataset ;
    dcterms:spatial <http://statistics.data.gov.uk/id/statistical-geography/K02000001> ;
    .

<http://statistics.data.gov.uk/id/statistical-geography/K02000001> a dcterms:Location ;
    geosparql:hasGeometry <http://statistics.data.gov.uk/id/statistical-geography/K02000001/geometry> ;
    .

<http://statistics.data.gov.uk/id/statistical-geography/K02000001/geometry> a geosparql:Geometry ;
    geosparql:asWKT """MULTIPOLYGON (((...)))"""^^geosparql:wktLiteral ;
    .

DCAT has a good section on the use of dcterms:spatial. It also recommends some usage for the dcterms:Location class:

Usage note:

  • For an extensive geometry (i.e., a set of coordinates denoting the vertices of the relevant geographic area), the property locn:geometry SHOULD be used.
  • For a geographic bounding box delimiting a spatial area the property dcat:bbox SHOULD be used.
  • For the geographic center of a spatial area, or another characteristic point, the property dcat:centroid SHOULD be used.

So we end up with some examples like this involving geometries, bboxes and centroids.

<https://data.gov.uk/datasets/example> a dcat:Dataset ;
    dcterms:spatial [
        a dcterms:Location ;
        locn:geometry """POLYGON ((
        4.8842353 52.375108 , 4.884276 52.375153 ,
        4.8842567 52.375159 , 4.883981 52.375254 ,
        4.8838502 52.375109 , 4.883819 52.375075 ,
        4.8841037 52.374979 , 4.884143 52.374965 ,
        4.8842069 52.375035 , 4.884263 52.375016 ,
        4.8843200 52.374996 , 4.884255 52.374926 ,
        4.8843289 52.374901 , 4.884451 52.375034 ,
        4.8842353 52.375108
        ))"""^^geosparql:wktLiteral ;
    ] .
<https://data.gov.uk/datasets/example> a dcat:Dataset ;
    dcterms:spatial [
        a dcterms:Location ;
        dcat:bbox """POLYGON((
        3.053 47.975 , 7.24  47.975 ,
        7.24  53.504 , 3.053 53.504 ,
        3.053 47.975
        ))"""^^geosparql:wktLiteral ;
    ] .
<https://data.gov.uk/datasets/example> a dcat:Dataset ;
    dcterms:spatial [
        a dcterms:Location ;
        dcat:centroid "POINT(4.88412 52.37509)"^^geosparql:wktLiteral ;
    ] .
PeterParslow commented 8 months ago

Thanks for that @rossbowen ; I think it pretty much answers my action to provide examples of the three approaches! Note: in my experience, geo data people don't use "location by centroid" in metadata.

I also like your explanation of when to use a controlled identifier. I think it may need to go a bit further, with GSS identifiers being appropriate for statistical areas with other 'controlled lists' better for administrative areas (e.g. national parks).

GeoDCAT adds a "location as a geographic name" example, given as:

a dct:Location, skos:Concept ; dct:identifier "202" ; skos:inScheme [ a skos:ConceptScheme ; dct:issued "2018-11-16T00:01:27+01:00"^^xsd:dateTime ; dct:title "UNSD - Methodology - Standard country or area codes for statistical use (M49)"@en ] ; skos:prefLabel "Sub-Saharan Africa"@en .

I'm sure you could provide a more "UK" example (e.g. using a GSS).

PeterParslow commented 8 months ago

In the meeting I took an action to provide examples. The software we use for the OS Data Catalogue(also used at Defra, EA, BGS, Scottish government) provides its RDF output in RDF/XML. These examples are from there, so may be more useful to some readers & less useful to others.... I also am not in a position to verify that it is "good RDF"; I do notice it doesn't include 'location by keyword' in the RDF, and I don't have any example other than 'by bounding box'.

  1. Bounding Box (unfortunately given as a polygon.... and even that isn't in the GeoDCAT recommended locn namespace!)
.... <http://www.opengis.net/def/crs/OGC/1.3/CRS84> Polygon((-8.45 49.86, -8.45 60.86, 1.78 60.86, 1.78 49.86, -8.45 49.86)) ...
AlasdairGray commented 8 months ago

Are you able to provide the link to the full RDF representation? That snippet is not valid RDF/XML (I've had similar problems in the past with OGC generated RDF).

PeterParslow commented 8 months ago

I gave the link to the Data Catalogue from which I downloaded a file & snipped out that dct:spatial bit. I think it was the file you can get from https://osmetadata.astuntechnology.com/geonetwork/srv/eng/catalog.search#/metadata/eaaad50e-0fa9-40be-84b5-d11740297320

It's generated by a widely used piece of open source software (Geonetwork Open Source), so if you can clearly what makes it invalid we can raise a request to fix it (although we will move to a newer version soon, so it may have been fixed already).

AlasdairGray commented 8 months ago

Thanks for the link. I downloaded the whole RDF representation and ran it through the validator. Unfortunately it is not valid RDF/XML Screenshot 2023-11-09 at 17 14 50

I've seen this before with the Geonetwork output and had reported it https://github.com/geonetwork/core-geonetwork/issues/7332. Although the issue was closed it was not fixed.

PeterParslow commented 8 months ago

My XML validator (Oxygen) reports the same. It took me a few minutes find how Oxygen decides to validate the file; it has "built in knowledge" of a schema for the http://www.w3.org/ns/dcat# namespace, a RELAX NG Compact Schema "based on one originally written by James Clark in # http://lists.w3.org/Archives/Public/www-rdf-comments/2001JulSep/0248.html". I've never opened a Relax NG file before.

What it is complaining about is a dct:license which has both a link "rdf:resource" and content. This is a shame because it is a very common XML pattern that is widely used in geospatial metadata. Other examples include keywords that both link to the authoritative register entry for the keyword & include the keyword (perhaps in a different language) locally for ease of use.

Regarding the issue you raised on core GeoNetwork, the comments suggest it has been moved because it relates to a specific GeoNetwork plug in. I'm not qualified to know if that's true, but you can see the issue still open at https://github.com/AstunTechnology/iso19139.gemini23/issues/146

You can see a related discussion at https://www.w3.org/2011/gld/track/issues/60, highlighting the desire to supplement the "license you link to" with some literal text. But the "solution" at Dublin Core appears to be to use "rights" with a sub item of "license" for the link? But that doesn't seem to capture the idea of a "short name + link" as in the example I provided. In hmtl that would be an anchor with a title.

Could you suggest (to Jo at the Astun GEMINI plugin issue? how this use case could be handled in RDF? I have only a small surface knowledge of RDF.

AlasdairGray commented 8 months ago

RDF/XML places additional constraints on the XML document. So something can be a valid XML document using only the terms defined in the RDF namespace but not be a valid RDF model. Note that the rest of the RDF/XML document may contain further errors.

RDF does not support labels on edges or a single edge pointing to a literal and a resource at the same time. The way to add a label to the resource would be to add an edge from the object resource. Note that there is also a problem with the URI for the resource https://osmetadata.astuntechnology.com/geonetwork/srv/resources/datasets/OS 1:50 000 Scale Colour Raster. URIs cannot contain spaces so I have replaced the spaces with +.

PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

<https://osmetadata.astuntechnology.com/geonetwork/srv/resources/datasets/OS+1:50+000+Scale+Colour+Raster>
    a dcat:Dataset ;
    dct:license "Use limitation dependent upon licence" ;
    dct:license <http://www.ordnancesurvey.co.uk/oswebsite/business/licences/index.html> .

<http://www.ordnancesurvey.co.uk/oswebsite/business/licences/index.html>
    rdfs:comment "Licences and agreements explained" .
PeterParslow commented 7 months ago

Thanks Alasdair. The problem I see with your proposal is that it appears to say that the Dataset has two licenses, rather than two statements about the same external object (which in this case may actual fail the DCT criteria to be called a licence, but lets put that to one side for now!).

AlasdairGray commented 7 months ago

It was unclear to me from the XML modelling what was meant since there were two statements regarding the license. If the two text sentences are meant to be about the same license then that can also be stated in the RDF.

PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

<https://osmetadata.astuntechnology.com/geonetwork/srv/resources/datasets/OS+1:50+000+Scale+Colour+Raster>
    a dcat:Dataset ;
    dct:license <http://www.ordnancesurvey.co.uk/oswebsite/business/licences/index.html> .

<http://www.ordnancesurvey.co.uk/oswebsite/business/licences/index.html>
    rdfs:label "Use limitation dependent upon licence" ;
    rdfs:comment "Licences and agreements explained" .

The predicates rdfs:label and rdfs:comment could be replaced with other predicates, e.g. something from the SKOS vocabulary if that was more appropriate.

PeterParslow commented 7 months ago

Thanks Alasdair. That seems closer to the original intent, where the "two statements" were in the same XML element (which as you pointed out is not allowable in RDF/XML).

All this has rather diverted from trying to show how the three ways to state geographical coverage would look.

PeterParslow commented 7 months ago

I have manually adjusted the RDF/XML file that I linked to above. I hope my adjustments are in line with Alasdair's input. The file now validates at https://www.w3.org/RDF/Validator/rdfval. I have also changed the extension from .rdf to .txt in order to attach it in GitHub.

(Personally, I would put the namespace declarations at the top, but I have tried to minimise OS-1-50-colour-metadata.txt my manual edits to the Geonetwork file, given that this should be an example of dct:spatial)

PeterParslow commented 7 months ago

Turns out I was using an old version of Geonetwork. The current version provides DCAT in RDF that looks much cleaner to me, and validates at W3C rdf-validator-test-record.rdf.txt