Open ldodds opened 3 years ago
Here's a valid JSON-LD document that conforms to the DCAT and GEODCAT profile:
{
"@context": {
"dcat": "http://www.w3.org/ns/dcat#",
"dct": "http://purl.org/dc/terms/",
"foaf": "http://xmlns.com/foaf/0.1/",
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"vcard": "http://www.w3.org/2006/vcard/ns#",
"xsd": "http://www.w3.org/2001/XMLSchema#"
},
"@type": "dcat:Catalog",
"dct:title": "Risk Data Library",
"dct:description": "A simple catalog to find datasets compliant with the Risk Data Library",
"foaf:homepage": "http://jkan.riskdatalibrary.org/",
"dct:publisher": {
"@type": "foaf:Organization",
"foaf:name": "GFDRR",
"foaf:homepage": "https://www.gfdrr.org/"
},
"dcat:dataset": [
{
"@id": "http://jkan.riskdatalibrary.org/datasets/exp-afg-agriculture/",
"@type": "dcat:Dataset",
"dct:identifier": "...",
"dct:title": "Afghanistan agriculture",
"dct:description": "Location, area and USD value of rainfed and irrigated agricultural crops in Afghanistan.",
"dcat:landingPage": "http://jkan.riskdatalibrary.org/datasets/exp-afg-agriculture/",
"dct:license": "https://creativecommons.org/licenses/by-sa/4.0/",
"dct:publisher": {
"@type": "foaf:Organization",
"foaf:name": "GFDRR",
"foaf:homepage": "https://www.gfdrr.org/"
},
"dcat:contactPoint": {
"vcard:fn": "GFDRR",
"vcard:hasEmail": "mailto:contact@riskdatalibrary.org"
},
"dct:spatial": [{
"rdfs:label": "Afghanistan"
}],
"dct:keyword": [ "Exposure" ],
"dcat:distribution": [{
"@type": "dcat:Distribution",
"dct:title": "Afghanistan agriculture",
"dcat:accessURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/exposure/exp-afg-infrastructures.zip",
"dct:format": "geotiff"
}]
}
]
}
Additional datasets would be added to the array value of dcat:dataset
property.
I've not tried to map all of the RDL metadata into that schema yet, but its not clear if other catalogues could harvest and import this anyway.
which one will be recommended for harvesting by WB DHH
I guess that we have several endpoints because those are provided by JKAN per default right?
If yes, I would simply keep the one that is the closest to the profile above.
I've not tried to map all of the RDL metadata into that schema yet, but its not clear if other catalogues could harvest and import this anyway.
Agree that it is not the priority.
Thanks @ldodds
FYI - rdl-datasets.json
is the canonical endpoint for RDL-JKAN.
Took me a while to remember where the datasets.json
file came from. It is part of the base JKAN implementation and I intended to remove it at earliest opportunity - I simply did not want to break the JKAN install while setting the RDL instance up.
I will mock up a GEODCAT endpoint shortly.
Hi @ldodds @matamadio
Attached is an example rdl-geodcat.json
file that is auto-generated based on the included/available datasets.
The only attribute I've left alone is dct:identifier
- I guess this can be the URL as well but is not a stable solution for the reasons outlined earlier (change of platform/service provider, etc.)
[EDIT: here I am referring to our conversation with @jeanpommier via email on the possible datapackage spec, circa 8 April 2021]
I can deploy this implementation if this example is adequate, and a solution to the above is decided on.
rdl-geodcat.txt
[EDIT: File is provided as a .txt
file as GitHub won't let me upload a JSON file]
Review alignment of metadata with HDX schema (also using DCAT)
Turns out it was faster for me to hack together a liquid template to generate entries.
The approach is to extract the metadata from the JKAN datasets, which are structured according to the schema specification, and entries were manually mapped to DCAT fields where not already done so by JKAN. Note that this doesn't pull the data out from the pages, so in the long term inadvertent inconsistencies between the page-embedded metadata and endpoint may occur.
@ldodds note that downloadURL
is used in combination with dct:format
and dcat:mediaType
.
format
is extracted from the dataset entry, whereas mediaType
is inferred from downloadURL
.
Really, all it does currently is prepend "application/" to the entry, unless the URL points to a bare CSV file.
Example for a multi-entry dataset page below. Latest full example attached as txt file.
{
"@id": "/datasets/lss-mdg-mh/",
"@type": "dcat:Dataset",
"dct:identifier": "/datasets/lss-mdg-mh/",
"dct:title": "Madagascar Multi-Hazard loss scenarios",
"dct:description": "Direct loss simulated on exposed building asset measured as Average Annual Losses (AAL) and six Return Period scenarios for multiple hazards (earthquake, pluvial flood, storm surge and strong wind).",
"dcat:landingPage": "/datasets/lss-mdg-mh/",
"dct:license": "https://creativecommons.org/licenses/by/4.0/",
"dct:publisher": {
"@type": "foaf:Organization",
"foaf:name": "GFDRR",
"foaf:homepage": "https://www.gfdrr.org/"
},
"dcat:contactPoint": {
"vcard:fn": "GFDRR",
"vcard:hasEmail": "contact@riskdatalibrary.org"
},
"dct:spatial": [
{
"rdfs:label": "Madagascar"
}
],
"dct:keyword": ["Loss"],
"dcat:distribution": [{
"@type": "dcat:Distribution",
"dct:title": "Madagascar multi-hazard loss scenarios",
"dct:format": "gpkg",
"dcat:downloadURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/loss/lss-mdg-mh.gpkg",
"dcat:mediaType": "application/gpkg"
},
{
"@type": "dcat:Distribution",
"dct:title": "Madagascar multi-hazard loss exceedence-probability curves",
"dct:format": "csv",
"dcat:downloadURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/loss/lss-mdg-mh-epc.zip",
"dcat:mediaType": "application/zip"
}
]
}
The metadata in the catalogue will be harvested by the Worldbank Data Hub. While they can write a custom importer, the preference is to have a DCAT compliant endpoint that provides the necessary metadata along with stable identifiers for individual datasets.
At the moment there are several JSON endpoints.
We need to agree: