ONSdigital / csvcubed

A CLI to build linked data cubes.
https://gss-cogs.github.io/csvcubed-docs/external/
Apache License 2.0
12 stars 1 forks source link

dcat:distribution, model fix, inspect API updates #911

Open canwaf opened 6 months ago

canwaf commented 6 months ago

With yanked csvcubed 0.5.0 we adopted the following change to the object model.

<4g-coverage.csv#dataset> <http://purl.org/dc/terms/description> "4G coverage in the UK by geographic area" ;
    <http://purl.org/dc/terms/title> "4G Coverage in the UK" ;
    <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/linked-data/cube#Attachable>, <http://purl.org/linked-data/cube#DataSet>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.w3.org/ns/dcat#Distribution>, <http://www.w3.org/ns/dcat#Resource> .

This impacts csvcubed's inspect command, which calls https://github.com/GSS-Cogs/csvcubed/blob/main/src/csvcubed/inspect/sparql_handler/sparql_queries/select_catalog_metadata.sparql which primarily looks for the dcat:Dataset

        SELECT DISTINCT ?dataset
        WHERE {
            GRAPH ?someGraph {
                ?dataset a dcat:Dataset.
            }
        }

Which is no longer present; however it should be present. Consider the application profile where the CSV-W is the distribution. This leads us to the following:

<4g-coverage.csv#csvqb> a <http://purl.org/linked-data/cube#Attachable>, <http://purl.org/linked-data/cube#DataSet>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.w3.org/ns/dcat#Distribution>, <http://www.w3.org/ns/dcat#Resource> ;
    <http://www.w3.org/ns/dcat#isDistributionOf> <4g-coverage.csv#dataset> .
<4g-coverage.csv#dataset> <http://purl.org/dc/terms/description> "4G coverage in the UK by geographic area" ;
    <http://purl.org/dc/terms/title> "4G Coverage in the UK" .

So the catalogue metadata is attached to the dataset, but the CSV-W's primary subject is now the Attachable, qb:Dataset, etc.

This should allow the SPARQL query to remain unchanged.

The metadata attached to the dcat:Distribution should be at most (Not these are not requirements, just what we can fill in that we already have we should add, nothing new new please):

classDiagram

class Distribution["Distribution a dcat:Distribution"] {
    +dcterms:identifier ∋ rdfs:Literal as xsd:string
    +dcterms:created ∋ rdfs:Literal as xsd:dateTime
    +dcterms:creator ∋ foaf:Agent
    +dcterms:issued ∋ rdfs:Literal as xsd:dateTime
    +prov:wasDerivedFrom ∋ [prov:Entity]
    +prov:wasGeneratedBy ∋ prov:Activity
    +dcat:downloadURL ∋ rdf:Resource
    +dcat:byteSize ∋ rdfs:Literal as xsd:nonNegativeInteger
    +dcat:mediaType ∋ dcterms:MediaType
    +wdrs:describedBy ∋ rdfs:Resource
    +spdx:checksum ∋ spdx:Checksum
}

tl;dr main subject of the CSV-W metadata file should be <dataset.csv#csvqb> which is dcat:isDistributionOf the dcat:Dataset. The dcat:Dataset is the one which should have the catalogue metadata attached to it.

SarahJohnsonONS commented 5 months ago

Currently, cubes that have been built using csvcubed v0.4.10 or lower cannot be inspected using csvcubed v0.5.0 or greater, as the primary identifier has changed from some-dataset.csv#dataset to some-dataset.csv#csvqb. In order to facilitate this change, a new distribution_uri property has been added to the CatalogMetadata class, and the select_catalog_metadata SPARQL query has been updated to extract the value of this property, if it is present.

Additional information on the version of csvcubed used to build the cube is also now available in the metadata JSON file, which may also be leveraged to determine how the cube should be inspected.

The distribution_uri value is not present in cubes built using older versions of csvcubed, so the inspect command fails if using a newer version of csvcubed. This is due to the MetadataPrinter class now using the distribution_uri in the get_primary_csv_url() method via DataCubeRepository.get_cube_identifiers_for_dataset(). There will be other places where there is a discrepancy, but this is where I would start.

Possible solutions:

Build activity information

csvcubed version < 0.5.0

...
{
    "@id": "aged-16-to-64-years-level-3-or-above-qualifications.csv#dataset",
    "http://www.w3.org/ns/prov#wasGeneratedBy": [
        {
            "@id": "aged-16-to-64-years-level-3-or-above-qualifications.csv#csvcubed-build-activity"
        }
    ]
}
...
{
    "@id": "aged-16-to-64-years-level-3-or-above-qualifications.csv#csvcubed-build-activity",
    "@type": [
        "http://www.w3.org/2000/01/rdf-schema#Resource",
        "http://www.w3.org/ns/prov#Activity"
    ],
    "http://www.w3.org/ns/prov#used": [
        {
            "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.4.10"
        }
    ]
}
...

csvcubed version >= 0.5.0

...
{
    "@id": "some-title.csv#csvqb",
    "http://www.w3.org/ns/prov#wasDerivedFrom": [
        {
            "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.5.0"
        }
    ],
    "http://www.w3.org/ns/prov#wasGeneratedBy": [
        {
            "@id": "some-title.csv#csvcubed-build-activity"
        }
    ]
}
...
{
    "@id": "some-title.csv#csvcubed-build-activity",
    "@type": [
        "http://www.w3.org/ns/prov#Activity",
        "http://www.w3.org/2000/01/rdf-schema#Resource"
    ],
    "http://www.w3.org/ns/prov#used": [
        {
            "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.5.0"
        }
    ]
},
{
    "@id": "https://github.com/GSS-Cogs/csvcubed/releases/tag/v0.5.0",
    "@type": [
        "http://www.w3.org/ns/prov#Entity",
        "http://www.w3.org/2000/01/rdf-schema#Resource"
    ],
    "http://purl.org/dc/terms/title": [
        {
            "@language": "en",
            "@value": "csvcubed v0.5.0"
        }
    ],
    "http://www.w3.org/ns/prov#hasPrimarySource": [
        {
            "@id": "https://pypi.org/project/csvcubed/0.5.0"
        }
    ],
    "http://www.w3.org/ns/prov#wasGeneratedBy": [
        {
            "@id": "some-title.csv#csvcubed-build-activity"
        }
    ]
}