Provide a DCAT compliant endpoint for harvesting by Worldbank Data Hub

ldodds commented 3 years ago

The metadata in the catalogue will be harvested by the Worldbank Data Hub. While they can write a custom importer, the preference is to have a DCAT compliant endpoint that provides the necessary metadata along with stable identifiers for individual datasets.

At the moment there are several JSON endpoints.

data.json - based on Project Open Metadata schema
datasets.json
rdl-datasets.json

We need to agree:

[x] which one will be recommended for harvesting by WB DHH
[x] identify any changes/mapping to align with DCAT (plus any extensions)
[ ] do necessary code changes to support
[ ] decide whether to remove any redundant endpoints?

ldodds commented 3 years ago

Here's a valid JSON-LD document that conforms to the DCAT and GEODCAT profile:

{
 "@context": {
   "dcat": "http://www.w3.org/ns/dcat#",
   "dct": "http://purl.org/dc/terms/",
   "foaf": "http://xmlns.com/foaf/0.1/",
   "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
   "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
   "vcard": "http://www.w3.org/2006/vcard/ns#",
   "xsd": "http://www.w3.org/2001/XMLSchema#"   
 },
 "@type": "dcat:Catalog",
 "dct:title": "Risk Data Library",
 "dct:description": "A simple catalog to find datasets compliant with the Risk Data Library",
 "foaf:homepage": "http://jkan.riskdatalibrary.org/",
 "dct:publisher": {
     "@type": "foaf:Organization",
     "foaf:name": "GFDRR",
     "foaf:homepage": "https://www.gfdrr.org/"
 },
 "dcat:dataset": [
   {
     "@id": "http://jkan.riskdatalibrary.org/datasets/exp-afg-agriculture/",
     "@type": "dcat:Dataset",
     "dct:identifier": "...",
     "dct:title": "Afghanistan agriculture",
     "dct:description": "Location, area and USD value of rainfed and irrigated agricultural crops in Afghanistan.",
     "dcat:landingPage": "http://jkan.riskdatalibrary.org/datasets/exp-afg-agriculture/",
     "dct:license": "https://creativecommons.org/licenses/by-sa/4.0/",
     "dct:publisher": {
        "@type": "foaf:Organization",
        "foaf:name": "GFDRR",
        "foaf:homepage": "https://www.gfdrr.org/"
     },
     "dcat:contactPoint": {
    "vcard:fn": "GFDRR",
        "vcard:hasEmail": "mailto:contact@riskdatalibrary.org"
     },
     "dct:spatial": [{
       "rdfs:label": "Afghanistan"
     }],
     "dct:keyword": [ "Exposure" ],     
     "dcat:distribution": [{
    "@type": "dcat:Distribution",
        "dct:title": "Afghanistan agriculture",
    "dcat:accessURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/exposure/exp-afg-infrastructures.zip",
        "dct:format": "geotiff"
     }]
   }
 ]
}

Additional datasets would be added to the array value of dcat:dataset property.

I've not tried to map all of the RDL metadata into that schema yet, but its not clear if other catalogues could harvest and import this anyway.

pzwsk commented 3 years ago

which one will be recommended for harvesting by WB DHH

I guess that we have several endpoints because those are provided by JKAN per default right?

If yes, I would simply keep the one that is the closest to the profile above.

pzwsk commented 3 years ago

I've not tried to map all of the RDL metadata into that schema yet, but its not clear if other catalogues could harvest and import this anyway.

Agree that it is not the priority.

ConnectedSystems commented 3 years ago

Thanks @ldodds

FYI - rdl-datasets.json is the canonical endpoint for RDL-JKAN.

Took me a while to remember where the datasets.json file came from. It is part of the base JKAN implementation and I intended to remove it at earliest opportunity - I simply did not want to break the JKAN install while setting the RDL instance up.

I will mock up a GEODCAT endpoint shortly.

ConnectedSystems commented 3 years ago

Hi @ldodds @matamadio

Attached is an example rdl-geodcat.json file that is auto-generated based on the included/available datasets.

The only attribute I've left alone is dct:identifier - I guess this can be the URL as well but is not a stable solution for the reasons outlined earlier (change of platform/service provider, etc.) [EDIT: here I am referring to our conversation with @jeanpommier via email on the possible datapackage spec, circa 8 April 2021]

I can deploy this implementation if this example is adequate, and a solution to the above is decided on.

rdl-geodcat.txt [EDIT: File is provided as a .txt file as GitHub won't let me upload a JSON file]

matamadio commented 3 years ago

Review alignment of metadata with HDX schema (also using DCAT)

ConnectedSystems commented 3 years ago

Turns out it was faster for me to hack together a liquid template to generate entries.

The approach is to extract the metadata from the JKAN datasets, which are structured according to the schema specification, and entries were manually mapped to DCAT fields where not already done so by JKAN. Note that this doesn't pull the data out from the pages, so in the long term inadvertent inconsistencies between the page-embedded metadata and endpoint may occur.

@ldodds note that downloadURL is used in combination with dct:format and dcat:mediaType.

format is extracted from the dataset entry, whereas mediaType is inferred from downloadURL. Really, all it does currently is prepend "application/" to the entry, unless the URL points to a bare CSV file.

Example for a multi-entry dataset page below. Latest full example attached as txt file.

{
      "@id": "/datasets/lss-mdg-mh/",
      "@type": "dcat:Dataset",
      "dct:identifier": "/datasets/lss-mdg-mh/",
      "dct:title": "Madagascar Multi-Hazard loss scenarios",
      "dct:description": "Direct loss simulated on exposed building asset measured as Average Annual Losses (AAL) and six Return Period scenarios for multiple hazards (earthquake, pluvial flood, storm surge and strong wind).",
      "dcat:landingPage": "/datasets/lss-mdg-mh/",
      "dct:license": "https://creativecommons.org/licenses/by/4.0/",
      "dct:publisher": {
        "@type": "foaf:Organization",
        "foaf:name": "GFDRR",
        "foaf:homepage": "https://www.gfdrr.org/"
      },
      "dcat:contactPoint": {
        "vcard:fn": "GFDRR",
        "vcard:hasEmail": "contact@riskdatalibrary.org"
      },
      "dct:spatial": [
          {
          "rdfs:label": "Madagascar"
          }
        ],
      "dct:keyword": ["Loss"],
      "dcat:distribution": [{
          "@type": "dcat:Distribution",
          "dct:title": "Madagascar multi-hazard loss scenarios",
          "dct:format": "gpkg",
          "dcat:downloadURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/loss/lss-mdg-mh.gpkg",
          "dcat:mediaType": "application/gpkg"
        }, 
      {
          "@type": "dcat:Distribution",
          "dct:title": "Madagascar multi-hazard loss exceedence-probability curves",
          "dct:format": "csv",
          "dcat:downloadURL": "https://rdl-jkan-datasets.s3-ap-southeast-2.amazonaws.com/loss/lss-mdg-mh-epc.zip",
          "dcat:mediaType": "application/zip"
        }
      ]
    }

rdl-geodcat.txt

GFDRR / rdl-jkan

Provide a DCAT compliant endpoint for harvesting by Worldbank Data Hub #14