ONSdigital / csvcubed-pmd

Utilities to convert csvcubed CSV-Ws to RDF acceptable to PMD
Apache License 2.0
1 stars 0 forks source link

Bug Report - Pmdification resulting in empty datasets on PMD #31

Open GregoryJPavier opened 2 months ago

GregoryJPavier commented 2 months ago

Bug Report - Pmdification resulting in empty datasets on PMD

On PMD, new drafts pushed through the pipeline were appearing with all their code-lists intact, but without any actual observational data.

Turns out the culprit is the pmdification step in the pipeline, which uses the csvcubed-pmd library’s pmdify script.

Essentially, a ‘barrier’ is being created within the graph relationship of the datasets catalog-entry. The structure of this relationship has changed due to updates to csvcubed, but this pmdify script has not been brought up to parity.

Pmdify still assigns the pmdcat:DataCube to dcat:Dataset but it needs to be assigned to the dcat:Distribution. Note: do not screw up code lists! Code lists do not use dcat:Distribution!

So, because of this change in structure the pmdify script is no longer taking everything in, as if it’s not reaching the cube#Dataset, and hence the observational data.

Below are visual representations of the data structure before the updates, afterwards, and the fixed version we're aiming for.

Old structure diagram

flowchart TD

    subgraph Old
        entry[#data-catalog-entry]
        dataset[#dataset]
        qbDataSet_type[qb:DataSet]
        pmdcatDatacube[pmdcat:Datacube]

        entry -->|pmdcat:datasetContents| dataset

        dataset -->|rdf:type| qbDataSet_type
        dataset -->|rdf:type| pmdcatDatacube
    end    

New structure diagram

flowchart TD

    subgraph Existing
        entry[#data-catalog-entry]
        dataset[#dataset]
        qbDataSet_type[qb:DataSet]
        qbDataSet_distribution[#qbDataSet]
        pmdcatDataCube[pmdcat:DataCube]
        dcatDistribution[dcat:Distribution]

        entry -->|pmdcat:datasetContents| dataset

        dataset-->|dcat:distribution| qbDataSet_distribution
        dataset -->|rdf:type| pmdcatDataCube

        qbDataSet_distribution -->  |dcat:isDistributionOf| dataset
        qbDataSet_distribution --> |rdf:type| qbDataSet_type
        qbDataSet_distribution -->|rdf:type| dcatDistribution
    end

Fixed structure diagram

flowchart TD
    subgraph Fixed
        entry[#data-catalog-entry]
        dataset[#dataset]
        qbDataSet_distribution[#qbDataSet]
        qbDataSet_type[qb:DataSet]
        pmdcatDataCube[pmdcat:DataCube]
        dcatDistribution[dcat:Distribution]
        dcatDataset[dcat:Dataset]

        entry -->|rdf:type| dcatDataset 
        entry -->|pmdcat:datasetContents| qbDataSet_distribution

        qbDataSet_distribution -->|rdf:type| pmdcatDataCube
        qbDataSet_distribution -->|rdf:type| qbDataSet_type
        qbDataSet_distribution -->|rdf:type| dcatDistribution
        dataset -->|dcat:distribution| qbDataSet_distribution

        qbDataSet_distribution -->|dcat:isDistributionOf| dataset
    end

The Code

We believe the issue lies within the _get_catalog_entry_from_dcat_dataset function in the pmdify script.

This section in particular may be of interest as this is where the code is assigning values to the pmdcat_dataset variable, which may need to be updated.

RickMoynihan commented 1 month ago

Hi Bill Roberts pointed me at your query here.

I think the issue is that you have broken the triple PMD requires to link a catalog entry to the qb:DataSet (i.e. the resource that contains the observation data through the <obs> qb:dataSet <dataCube> relation).

Basically pmdcat:datasetContents and the corresponding pmdcat:DataCube type are used to tell PMD where it can find the data to render with the datacube viewer. i.e. it's best to consider these vocabulary items as PMD specific rendering hints, rather than terms that carry broader semantics. Put another way pmdcat:datasetContents doesn't mean "these are the dataset contents", it means "pmd when a user is looking at this catalog entry try and render a UI for this resource".

So I think you most likely want to construct data like this:

flowchart TD
    CE[#data-catalog-entry] -->|another:predicate| DS[#dataset]
    DS --> |dcat:distribution| QB[#qbDataSet]
    CE -->|pmdcat:datasetContents|QB
    QB -->|rdf:type| ClassPQB[pmdcat:DataCube]
    QB -->|dcat:isDistributionOf|DS
    QB -->|rdf:type| ClassQB[qb:DataSet]
    QB -->|rdf:type| ClassDDS[dcat:Distribution]

    Obs(obs 1..N) -->|qb:dataSet| QB

Where another:predicate is an appropriate property of your choosing and if you can't find one you can always coin something like ons:datasetContents.