GFDRR / rdl-data

Challenge Fund Database combining Hazard, Exposure, Loss and Vulnerability schema into a single database
GNU Affero General Public License v3.0
1 stars 2 forks source link

How to store and serve metadata #34

Closed matamadio closed 3 years ago

matamadio commented 3 years ago

Also see previous discussion: https://github.com/GFDRR/rdl-website/issues/80

This is a critical point for RDL, and has long being discussed on several occasions. I thought it could be useful to make the point on status and options with pros and cons, to further discuss with Leigh.

We need two types of solutions:

A) One quick solution for JKAN MVP (already prototyped)

Because there is no DB and the catalogue just links to S3 stored zip files, metadata file are stored in a text file within the zip. The file links to the JKAN schema which has been modified to mirror the RDL attributes; the same file can be read with text editor by user, which has to look at documentation for explanation of fields.

[//]: # (Model fields)
model_name: Name of source model
model_description: "Descriptor"
model_hazard_type: "EQ"
model_process_type: "PRO"
model_hazard_link: ""
model_exposure_link: ""
model_vulnerability_link: ""

B) One optimal solution for next redesign

Points of discussions:

1. What is data-what is metadata

Not a trivial question; in the PostGRE, csv-based implementation of RDL, both schema attributes and data tables were stored and indexed together; there is no clear separation between data and metadata. Moving towards a file-based approach but with dynamic DB, we are separating what is data (stored in S3) from what is metadata (stored in DB, indexed and searchable). The metadata would be strictly mirroring the attributes of the RDL schema.

2. Adoption of standards for crearing and exchanging metadata

There are several examples of standard metadata profiles, and there's actually a standard on how to create them (ISO 19106). These standards can cover various levels and types of information. Here are also some examples for geo and non-geo metadata profiles/extensions in different domains: https://rd-alliance.github.io/metadata-directory/extensions/ It has been proposed to create our own ISO-based metadata profile for RDL as it was done for INSPIRE. See section 4.1 in this paper for some examples including NAP and INSPIRE https://www.mdpi.com/2220-9964/8/6/280

Stu has produced a review of existing standards: https://drive.google.com/file/d/1ksCfm4OVgwKUn50e2eQ56lgqQ1yx7vjq/view?usp=sharing

and tryied to match existing ISO metadata schema to key attributes of RDL: https://drive.google.com/file/d/1Mtn2SRl8hSfemm1_-mjja89KSbnSE13p/view

Most GIS software will be able to read the "core" metadata, which has non-human, but also human-friendly elements (abstract, POC, license, etc). GFDRR invested quite a bit in making improvements on how metadata is handled in GeoNode and QGIS, to make the reading and editing more human-friendly - yet we are far from optimal implementation.

The preferred format of exchange is xml, which comes in the same zip as the shp or tif. Example of ISO metadata file Example of DCAT metadata file

Most often than not, risk layers from WB projects come empty of any -interesting- xml, meaning that only basic GIS info are stored. Only a fraction of this information is human-understandable from opening the file in browser.

--WIP--

matamadio commented 3 years ago

A) MVP

Each metadata file (related to one or more downloads) is saved as name-of-data.md in the _dataset folder. These can be edited directly from JKAN admin for fixing errors, but also can be created directly via jkan ("add dataset" is now under dev).

Therefore, adding a copy of the .md file into the relative dataset zip at the moment of upload is not optimal, as it could change later; and collate the md to zip at the moment of download is not easily achieveable on GH-JKAN. Proposed solution is to have a "Download metadata" button in each dataset page. The button downloads a copy of the last version of the .md file.

See also https://github.com/GFDRR/rdl-jkan/issues/7

matamadio commented 3 years ago

See how is possible to match our general attributes with Dublincore: https://en.wikipedia.org/wiki/Dublin_Core @cgiovando

matamadio commented 3 years ago

The contribution table now matches 13 of the 15 DublinCore basic attributes, although the RDL fieldname is different from DC.

RDS ATTRIBUTES DUBLINCORE EQUIVALENT
title 14. Title – “A name given to the resource.”
abstract 5. Description – “An account of the resource.”
component 15. Type – “The nature or genre of the resource.”
organization 9. Publisher – “An entity responsible for making the resource available.”
model_source 3. Creator – “An entity primarily responsible for making the resource.”
model_date 4. Date – “A point or period of time associated with an event in the lifecycle of the resource.”
version  
purpose 13. Subject – “The topic of the resource.”
project  12. Source – “A related resource from which the described resource is derived.”
notes  
biblio_title  
biblio_url  
geo_coverage 2. Coverage – “The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.”
publish  
license_code 11. Rights – “Information about rights held in and over the resource.”
maintainer 1. Contributor – “An entity responsible for making contributions to the resource.”
maintainer_email  
resources_fields  Multiple resources can be added to the same contribution.
name 10. Relation – “A related resource.”
url  
format 6. Format – “The file format, physical medium, or dimensions of the resource.”

2 DC attributes not yet used, which is not a problem since DC define all attributes as optional.

matamadio commented 3 years ago

This whole discussion is superceeded by @ldodds and Jean work for aligning metadata to DCAT. See https://github.com/GFDRR/rdl-standard/issues/7