GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.
https://docs.riskdatalibrary.org/
Creative Commons Attribution Share Alike 4.0 International
13 stars 1 forks source link

[Schema] update dataset identifier description #184

Closed odscjen closed 11 months ago

odscjen commented 1 year ago

From a suggestion in https://github.com/GFDRR/rdls-spreadsheet-template/issues/3#issuecomment-1671200463, update the description of identifier from recommending use of URL to use of project ID

duncandewhurst commented 1 year ago

@matamadio can one project generate more than one dataset?

matamadio commented 1 year ago

@matamadio can one project generate more than one dataset?

Yes, indeed it can. Project number would be used as general ID to group related datasets. So it's not unique. Would the same happen using HTTP URI?

duncandewhurst commented 1 year ago

Related issue: https://github.com/GFDRR/rdl-standard/issues/53

The current description of id specifies that the identifier should be unique:

A unique identifier for the dataset. Use of an HTTP URI is recommended.

In order to conform to that description, if using an HTTP URI, publishers would need to ensure that uniquely identifies an individual dataset, e.g. http://www.example.com/projects/1/datasets/1, rather than being a URI that relates to many datasets, such as the URI of the web-page for a project (e.g. http://www.example.com/projects/1) or a list of datasets (e.g. http://www.example.com/projects/1/datasets).

I think this discussion points to a need to author some guidance on how to populate id.

I propose adding the following content to https://rdl-standard.readthedocs.io/en/dev/guides/metadata/#how-to-publish-rdls-metadata and to update the description of id to link to it.

@odscjen @matamadio @stufraser1 please let me know what you think. The final paragraph speaks to the case of creating RDLS metadata using the spreadsheet template before a dataset is added to the World Bank's Data Catalog.

How to assign a dataset identifier

You need to assign a unique identifier (id) to each dataset for which you are publishing RDLS metadata. The preferred approach is to use a persistent HTTP URI in accordance with Data on the Web Best Practices $8.7 Data Identifiers.

If you are authoring RDLS metadata for a dataset that is already uniquely identified by a persistent HTTP URI, you ought to set id to the existing HTTP URI for the dataset.

For example, the GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030) dataset is identified by the following URI in the publisher's data catalog: http://data.europa.eu/89h/9f06f36f-4b11-47ec-abb0-4f8b7b1d72ea. Therefore, in the RDLS metadata describing the dataset, id is set to the existing URI:

{
  "id": "http://data.europa.eu/89h/9f06f36f-4b11-47ec-abb0-4f8b7b1d72ea",
  "title": "GHS-BUILT-S R2023A - GHS built-up surface grid, derived from Sentinel2 composite and Landsat, multitemporal (1975-2030)"
}

If you are authoring RDLS metadata for a dataset that is not already uniquely identified by a persistent HTTP URI, you ought to generate a persistent HTTP URI for the dataset. For example, by adding the dataset to a data catalog that assigns persistent HTTP URIs.

Otherwise, if you cannot generate a persistent HTTP URI for a dataset, for example, because you are authoring RDLS metadata before adding the dataset to a data catalog, you ought to set id to a globally unique identifier of your choice, such as a version 4 UUID. For more information, see [how to generate a universally unique identifier]().

How to generate a universally unique identifier

If you are writing your own software or if you prefer to use the command line, several libraries and tools are available to generate universally unique identifiers (UUIDS), for example:

If you prefer to use a graphical user interface, several web-based tools are available, for example Online UUID Generator.

matamadio commented 1 year ago

If I understand correctly, in the case of DDH-RDL collection that would mean either:

duncandewhurst commented 1 year ago

If I understand correctly, in the case of DDH-RDL collection that would mean either:

* first create the dataset entry (draft) and then assign it the (persistent?) id that is created by catalog.

* generate a random id

Yep, that's correct.

duncandewhurst commented 11 months ago

Closed by https://github.com/GFDRR/rdl-standard/pull/239

Edit: Update link to point to PR rather than issue.