How to store and serve metadata

matamadio commented 3 years ago

Also see previous discussion: https://github.com/GFDRR/rdl-website/issues/80

This is a critical point for RDL, and has long being discussed on several occasions. I thought it could be useful to make the point on status and options with pros and cons, to further discuss with Leigh.

We need two types of solutions:

A) One quick solution for JKAN MVP (already prototyped)

Because there is no DB and the catalogue just links to S3 stored zip files, metadata file are stored in a text file within the zip. The file links to the JKAN schema which has been modified to mirror the RDL attributes; the same file can be read with text editor by user, which has to look at documentation for explanation of fields.

[//]: # (Model fields)
model_name: Name of source model
model_description: "Descriptor"
model_hazard_type: "EQ"
model_process_type: "PRO"
model_hazard_link: ""
model_exposure_link: ""
model_vulnerability_link: ""

B) One optimal solution for next redesign

Points of discussions:

1. What is data-what is metadata

Not a trivial question; in the PostGRE, csv-based implementation of RDL, both schema attributes and data tables were stored and indexed together; there is no clear separation between data and metadata. Moving towards a file-based approach but with dynamic DB, we are separating what is data (stored in S3) from what is metadata (stored in DB, indexed and searchable). The metadata would be strictly mirroring the attributes of the RDL schema.

2. Adoption of standards for crearing and exchanging metadata

There are several examples of standard metadata profiles, and there's actually a standard on how to create them (ISO 19106). These standards can cover various levels and types of information. Here are also some examples for geo and non-geo metadata profiles/extensions in different domains: https://rd-alliance.github.io/metadata-directory/extensions/ It has been proposed to create our own ISO-based metadata profile for RDL as it was done for INSPIRE. See section 4.1 in this paper for some examples including NAP and INSPIRE https://www.mdpi.com/2220-9964/8/6/280

Stu has produced a review of existing standards: https://drive.google.com/file/d/1ksCfm4OVgwKUn50e2eQ56lgqQ1yx7vjq/view?usp=sharing

and tryied to match existing ISO metadata schema to key attributes of RDL: https://drive.google.com/file/d/1Mtn2SRl8hSfemm1_-mjja89KSbnSE13p/view

Most GIS software will be able to read the "core" metadata, which has non-human, but also human-friendly elements (abstract, POC, license, etc). GFDRR invested quite a bit in making improvements on how metadata is handled in GeoNode and QGIS, to make the reading and editing more human-friendly - yet we are far from optimal implementation.

The preferred format of exchange is xml, which comes in the same zip as the shp or tif. Example of ISO metadata file Example of DCAT metadata file

Most often than not, risk layers from WB projects come empty of any -interesting- xml, meaning that only basic GIS info are stored. Only a fraction of this information is human-understandable from opening the file in browser.

--WIP--

matamadio commented 3 years ago

A) MVP

Each metadata file (related to one or more downloads) is saved as name-of-data.md in the _dataset folder. These can be edited directly from JKAN admin for fixing errors, but also can be created directly via jkan ("add dataset" is now under dev).

Therefore, adding a copy of the .md file into the relative dataset zip at the moment of upload is not optimal, as it could change later; and collate the md to zip at the moment of download is not easily achieveable on GH-JKAN. Proposed solution is to have a "Download metadata" button in each dataset page. The button downloads a copy of the last version of the .md file.

matamadio commented 3 years ago

See how is possible to match our general attributes with Dublincore: https://en.wikipedia.org/wiki/Dublin_Core @cgiovando

matamadio commented 3 years ago

The contribution table now matches 13 of the 15 DublinCore basic attributes, although the RDL fieldname is different from DC.

RDS ATTRIBUTES	DUBLINCORE EQUIVALENT
title	14. Title – “A name given to the resource.”
abstract	5. Description – “An account of the resource.”
component	15. Type – “The nature or genre of the resource.”
organization	9. Publisher – “An entity responsible for making the resource available.”
model_source	3. Creator – “An entity primarily responsible for making the resource.”
model_date	4. Date – “A point or period of time associated with an event in the lifecycle of the resource.”
version
purpose	13. Subject – “The topic of the resource.”
project	12. Source – “A related resource from which the described resource is derived.”
notes
biblio_title
biblio_url
geo_coverage	2. Coverage – “The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.”
publish
license_code	11. Rights – “Information about rights held in and over the resource.”
maintainer	1. Contributor – “An entity responsible for making contributions to the resource.”
maintainer_email
resources_fields	Multiple resources can be added to the same contribution.
name	10. Relation – “A related resource.”
url
format	6. Format – “The file format, physical medium, or dimensions of the resource.”

2 DC attributes not yet used, which is not a problem since DC define all attributes as optional.

Identifier – “An unambiguous reference to the resource within a given context.” That would correspond to ID in database, or permalink to the page.
Language – “A language of the resource.” Not always relevant, standard for table data is english.

matamadio commented 3 years ago

This whole discussion is superceeded by @ldodds and Jean work for aligning metadata to DCAT. See https://github.com/GFDRR/rdl-standard/issues/7

GFDRR / rdl-data