ONSdigital / sdg-csv-data-filler

MIT License
1 stars 2 forks source link

Mapping SDG meta data fields to metadatajson fields #9

Open jwestw opened 3 years ago

robons commented 3 years ago

So we need the following rdf added to the .csv-metadata.json file:

 <#catalog-entry> a dcat:Dataset; # Needs to be mapped to a pmdcat:Dataset in gss-cogs augmentation of imported CSV-W.
        rdfs:label "Short Label Describing Dataset"@en ;
        dct:title "Short Label Describing Dataset"@en ;
        pmdcat:datasetContents <#dataset> ; # Unfortunate that we have to use PMDCAT here.
        dct:creator <http://creator.uri>;
        rdfs:comment "Put a short description here."@en ;
        dct:description "Put big beasty desciption here"@en ;
        dct:issued "2020-12-23"^^xsd:date ;
        dct:modified "2021-01-06T15:59:04.420292+00:00"^^xsd:dateTime ;
        dct:publisher <http://publisher.uri>;
        dcat:keyword "Births, Deaths and Marriages, Cause of Death, Deaths, Death Statistics, Coronavirus (COVID-19) Statistics"@en ;
        dcat:landingPage <http://user.landing.page.to.download.data.csv> .
        # pmdcat:graph ns1:registrations ; ; # TODO: Push this into gss-cogs augmentation of imported CSV-W.
        # void:sparqlEndpoint <http://gss-data.org.uk/sparql> ; # TODO: Push this into gss-cogs augmentation of imported CSV-W.

So here are the fields we'll need metadata for and appropriate mappings to the SDG metadata json file format e.g.:

Dataset metadata SDG Metadata JSON Key Hardcoded Value
rdfs:label indicator_available or graph_title?
dct:title identical to rdfs:label
pmdcat:datasetContents <#dataset>
dct:creator https://www.ons.gov.uk
rdfs:comment indicator_name?
dct:description A combination of indicator_name, target_name and computation_definitions?
dct:issued source_release_date_1 (re-format to ISO)
dct:modified source_release_date_n? (re-format to ISO)
dct:publisher https://www.ons.gov.uk
dcat:keyword data_keywords (but alter ;s to ,s)
dcat:landingPage source_url_1

@james-westwood I've added some provisional mappings above. The ones I'm a bit uncertain about have question marks next to them. Would you be able to verify on your side which fields are most appropriate?

Just some more detailed discussion:

I'm not sure what the difference between indicator_available and graph_title are, we just need a title for the dataset, whichever is best.

rdfs:comment is where we can place a sentence or two describing the dataset. Which field is it best to use here?

dct:description is the longest description where we can add a fair bit of detail. Multiple paragraphs even. We can programmatically combine multiple of your fields together to come up with something which provides a good summary of what the user is looking out. Which fields should we combine?

Let me know if you're uncertain about anything.

ANikolova22 commented 3 years ago

Hello! Thought I'd leave comments directly here. I think if you just need the name of the dataset for label and title, the indicator_available should be best placed. My only concern is that I think this field may be left blank for some indicators, in which case the indicator_name should be used. But if easier for you, we may rectify that by though through indicators metadata and making sure the indicator title is duplicated in the indicator_available field if that field is blank (does that make sense?)

For the comment variable, I think the best option would be mapping it to Indicator_available_description. This field is not always populated, but it would be important for proxy indicators. All other information would fall in the description.

For issued... as I mentioned on Slack, it's not straight forward and can be a bit misleading perhaps if based on source release dates, as they can be quite different from when the indicator itself was published. Also, the indicator may be modified in-between source publications, so the modified field should be linked to the last date modified of the indicator, not the source(s).

For the description, it would be good to combine information from fields 'definitions', 'other_information' and perhaps even 'calculations'. Not sure how long you want it to be, but including the calculations would give a comprehensive picture.

robons commented 3 years ago

Hello! Thought I'd leave comments directly here. I think if you just need the name of the dataset for label and title, the indicator_available should be best placed. My only concern is that I think this field may be left blank for some indicators, in which case the indicator_name should be used. But if easier for you, we may rectify that by though through indicators metadata and making sure the indicator title is duplicated in the indicator_available field if that field is blank (does that make sense?)

so meta["indicator_available"] if "indicator_available" in meta else meta["indicator_name"]? i.e. we can be confident that indicator_name will always be set?

For the comment variable, I think the best option would be mapping it to Indicator_available_description. This field is not always populated, but it would be important for proxy indicators. All other information would fall in the description.

Ok, the comment should be optional, so it should be okay that indicator_available_description isn't always set. We'll use it where available.

For issued... as I mentioned on Slack, it's not straight forward and can be a bit misleading perhaps if based on source release dates, as they can be quite different from when the indicator itself was published. Also, the indicator may be modified in-between source publications, so the modified field should be linked to the last date modified of the indicator, not the source(s).

So the model we have for data has places for both when the information was published to our platform (& subsequently modified) and the date the original data was first published at source (& subsequently modified). The only information we need from your side is the date when the data was originally published and if (& when) it was modified. Do we have anywhere we can get this information from?

For the description, it would be good to combine information from fields 'definitions', 'other_information' and perhaps even 'calculations'. Not sure how long you want it to be, but including the calculations would give a comprehensive picture.

Sounds good to me, we'll just jam them together, separating with newlines.

ANikolova22 commented 3 years ago

The only information we need from your side is the date when the data was originally published and if (& when) it was modified.

Ok, so we don't have the date when the data was originally published (at source), but the modification date would be the source_release_date_N. This may or may not coincide with the originally published date, as it would reflect the latest date of the used source at the time of indicator upload.

EmmaWoodONS commented 3 years ago

rdfs:label

I agree with the suggestion to use indicator_available unless it is blank, in which case use indicator_name. As I understand it we should not populate indicator_available unless it is different to indicator_name as this causes the page to look weird (essentially causing a repeated title at the top of the page). Graph_title won't work for label as we now also have the option of multiple graph titles depending on units selected.

rdfs:comment

I agree with Atanaska that national_indicator_description is probably best here, as it is always short. We could also revisit how we use this and perhaps populate it more often.

dct:description

This should include everything that isn't already accounted for, particularly computation_calculations, computation_definitions, other_info, and the free text (important caveats are usually in the last two). The free text often isn't used and doesn't have a name in the metadata, it just comes at the end of the file. We should potentially also include the link to the UN metadata and the links to the sources.