GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.
https://docs.riskdatalibrary.org/
Creative Commons Attribution Share Alike 4.0 International
16 stars 1 forks source link

[Documentation] DRAFT input #42

Closed matamadio closed 1 year ago

matamadio commented 1 year ago

Issue to draft the content of next documentation while it's being restructured.

SEE ONLINE PREVIEW


Data preparation

This section is needed to give guidance on dataset formats and packaging.

Data structure

File formats

Check Geonode guidance instructions on how to optimise geodata (compression, pyramids)

Packaging as resources

stufraser1 commented 1 year ago

Suggestions on the packaging of resources / hierarchy within datasets https://docs.google.com/document/d/1PgL4AYvAJVQ74TMceCeGbdEkzphCnh1nMqCszFk9T_0/edit?usp=share_link

matamadio commented 1 year ago

Started drafting a condensed version of the shared doc. @stufraser1 please edit this comment for small changes, otherwise if major changes please do in new comment.

Best practice for risk data packaging

The data structure and packaging of the output as obtained from the data analysts may not always align with the way we want users of the RiskDataLibrary to search and download data.

Datasets shared in risk catalogues (e.g. Risk Data Library Collection&q=&start=0&sort=last_updated_date%20desc)) are provided as individual RESOURCES, which should be packed (grouped) according to two main criteria:

We also need to consider:

Where there are many resources for a dataset, there is a temptation to include a folder structure in Data Catalog. This does not enable easy access to resources. Datasets and Resources should be set up to facilitate easy finding of the specific component of analysis, and grouping resources together in a sensible fashion, without creating problematically large download sizes.

Decisions on how to structure risk data should be taken on a project-by-project basis, because there is a wide variety of how data are structured depending on the components of a project. However, here are a few examples:


Hazard data

Format / data types

Hazard data includes:

Generally, hazard data (footprints) takes the form of raster (geo grid) data (GeoTIFF / COG). Supporting data (hazard curves, historical catalogue) often come as tables (csv, xlsx) or vector data (gpkg, shp). They can also be packaged in a similar fashion.

Thematic grouping

The main thematic groupings in hazard data are:

Geographic grouping

In general, splitting raster datasets into smaller parts is not advised, according to self-dependency and completeness criteria. If required for data efficiency, always consider a larger extent than needed as to avoid cross-border artefacts.

[FIGURE EXAMPLE: BORDER CLIP vs EXTENT CLIP OF GLOBAL LAYER ON A COUNTRY]

Packaging recommendation

We recommend grouping exposure data in the following hierarchy:

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> hazard data
   * Dataset: <project name> <hazard1> RP maps
      * Dataset: <project name> <hazard1> data <country1>
         * Zipped Resource: <hazard1> <country1>_2020
         * Zipped Resource: <hazard1> <country1>_2050
         * Zipped Resource: <hazard1> <country1>_2080
      * Dataset: <project name> <hazard1> data <country2>
         * Zipped Resource: <hazard1> <country2>_2020
         * Zipped Resource: <hazard1> <country2>_2050
         * Zipped Resource: <hazard1> <country2>_2080
   * Dataset: <project name> <hazard2> RP maps
      * …
         * …
   * Dataset: <project name> <hazard1> historical catalog

Exposure data

Format

Exposure geospatial data can take the form of vector (gpkg, shp), or raster (GeoTIFF / COG). In some cases, exposure comes as table (csv, xls).

[EXAMPLE PIC FOR EACH FORMAT]

Geopackage (`.gpkg`) are preferred for vector data over shapefiles (`.shp`). Conversion from .shp to g.pkg is lossless and usually size-efficient. Where shp format is maintained, they should be provided as a zip folder containing the multiple components of the shapefile dataset (.shp, .dbf, .xml, .ovr, etc.).
Read more: (link to format page - next post)

Thematic grouping

The main thematic groupings in exposure data are:

Geographic grouping

In general, splitting raster datasets into smaller parts is not advised, according to self-dependency and completeness criteria. If required for data efficiency, always consider a larger extent than needed as to avoid cross-border artefacts.

Packaging recommendation

We recommend grouping exposure data in the following hierarchy:

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> exposure data
   * Dataset: <project name> <country1> exposure data
      * Dataset: <project name> <country1> exposure data - 2020
         * Resource: <country1>_2020_exposure_RES
         * Resource: <country1>_2020_exposure_COM
         * Resource: <country1>_2020_exposure_EDU
         * Resource: <country1>_2020_exposure_ROAD
      * Dataset: <project name> <country1> exposure data - 2050
         * Resource: <country1>_2050_exposure_RES
         * Resource: <country1>_2050_exposure_COM
         * Resource: <country1>_2050_exposure_EDU
         * Resource: <country1>_2050_exposure_ROAD
   * Dataset: <project name> <country2> exposure data
      * …
         * …

Vulnerability data

Format

Vulnerability data are usually provided as table data (csv, xls) containing the impact model function and parameters. Often, vulnerability models are proprietary data and only shared as pictures; this has low reusability and should be avoided. Always try to obtain a mathematical description for this component.

Thematic grouping

The main thematic groupings in vulnerability data are:

Geographic grouping

Vulnerability curves may be developed for individual countries or environments within a project. Where this is the case, this grouping should be retained.

Packaging recommendation

We recommend to group exposure data in the following hierarchy:

Note that this hierarchy should be maintained even when packing all the data in one file, e.g. multiple sheetx of an excel file.

[EXAMPLE OF MULTIPLE IMPACT MODELS IN ONE FILE]

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> vulnerability data
   * Dataset: <project name> <hazard1> vulnerability data
      * Resource: <hazard1>_RES_timber
      * Resource: <hazard1>_RES_RC
      * Resource: <hazard1>_COM_steel
      * Resource: <hazard1>_COM_RM
      * Or Resource: <hazard1>_vulnerability _curves_all_types (if data all in one file)
   * Dataset: <project name> <hazard2> vulnerability data
      * …
      * …

Loss data

Format

Loss data often comes in the form of:

Thematic grouping

The main thematic groupings in loss data are:

Geographic grouping

Losses are usually aggregated at national or subnational administrative level (ADM2, ADM1, or ADM0). Losses can also be provided per asset (e.g. individual buildings or raster footprints) but it is not usual - although these files are often generated by the analysts.

Packaging recommendation

We recommend grouping exposure data in the following hierarchy:

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> loss data
   * Dataset: <project name> <hazard1> loss data 2020
      * Resource: <hazard1>_RES
      * Resource: <hazard1>_COM
      * Resource: <hazard1>_AllSectors
   * Dataset: <project name> <hazard2> loss data 2020
      * …
      * …
   * Dataset: <project name> <hazard1> loss data 2050
   * Dataset: <project name> <hazard1> loss data 2080
matamadio commented 1 year ago

Data formats

Risk data can be made of spatial or non-spatial data.

Recommended geodata formats

Vector data: GeoPackage

GeoPackage (.gpkg) is an open, non-proprietary SQLite3 extended Database container. It is platform-independent and standards-based (OGC, QGIS, GDAL). Similar to ESRI geodatabase, but more responsive. It is a single-file format that can store anything from vector data and attributes, symbology, pyramids, table data as individual layers within one geopackage. It is possible to store rasters, but its supports for raster data is still limited and we don't recommend storing those as geopackage. Supports SQL and API to DB - fit for web applications, can export to PostGIS. There is no limit of attributes, attribute name size, or file size (unlike shapefile). Internal metadata specifications are under development.

Raster data: GeoTIFF / COG (.tif)

GeoTIFF (.tif) is the image standard file for GIS and satellite remote sensing applications. It can store multiple realisations as “bands”. GeoTIFFs can be accompanied by other auxiliary files (.tfw for raster geolocation, .xml for metadata, .aux for projections and others, .ovr for pyramids to improve visualisation). These should be packed together with the .tif files in a zip for sharing. A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. It does this by leveraging the ability of clients issuing ​HTTP GET range requests to ask for just the parts of a file they need. This is the best option for data that needs to be hosted ona geocatalogue such as GeoNode.

Supported geodata formats

Vector data

Raster data

Recommended non-spatial formats

matamadio commented 1 year ago

WB data catalogue (DDH): update workflow

General

The Risk Data Library Collection sits within the World Bank Data Catalog and is meant to store standard risk data. The collection can be accessed from the collections page or used as a filter on the left bar to search for data within the collection. immagine

Adding datasets

Datasets can be submitted for review and publication on the Data Catalog by any World Bank Staff, ETC or STC. These people have the role of ‘Data depositor’.

Two approaches to upload data:

Datasets can be added to the RDL Collection by the RDL team, after approval.

Individual datasets

When all required (and optional) information has been entered, click on ‘Save as draft’. The dataset will appear in the your datasets list.

Bulk upload

In cases where large volumes of project data should be uploaded, DDH team can assist with bulk upload. The workflow steps are:

  1. Store project data in folders on OneDrive.
  2. Create an excel spreadsheet describing the datatype with each dataset name, URL to data and URL to prepared JSON metadata.
  3. Describe the data structure to be achieved on DDH.
  4. DDH team will copy the data and metadata to DDH Sharepoint.
  5. DDH team will use scripts to upload datasets; these will appear in your ‘My Datasets’ for review and any further editing.

Adding RDL metadata

Contacts

RDL Team

DDH Team

DDH team is responsible for review and publication of submitted datasets, and to assign datasets to RDL collection in short-term when he reviews data.

matamadio commented 1 year ago

The content posted has been included in current draft. Please refer to repo version for any comment or edit.