Issue to draft the content of next documentation while it's being restructured.

Data preparation

This section is needed to give guidance on dataset formats and packaging.

Data structure

Folders structure, file and attributes naming conventions Check if any of old content still useful from: https://docs.riskdatalibrary.org/local.html

File formats

Adopted standard formats
- Geodata: tiff (cog), gpkg
- Table data: xlsx, csv

Check Geonode guidance instructions on how to optimise geodata (compression, pyramids)

Packaging as resources

Grouping data together: specify level of aggregation depending on the data

Suggestions on the packaging of resources / hierarchy within datasets https://docs.google.com/document/d/1PgL4AYvAJVQ74TMceCeGbdEkzphCnh1nMqCszFk9T_0/edit?usp=share_link

Started drafting a condensed version of the shared doc. @stufraser1 please edit this comment for small changes, otherwise if major changes please do in new comment.

Best practice for risk data packaging

The data structure and packaging of the output as obtained from the data analysts may not always align with the way we want users of the RiskDataLibrary to search and download data.

Datasets shared in risk catalogues (e.g. Risk Data Library Collection&q=&start=0&sort=last_updated_date%20desc)) are provided as individual RESOURCES, which should be packed (grouped) according to two main criteria:

GEOGRAPHY: For example, in a regional analysis, users may want to access data for one/each country - so data should be packaged to download the dataset with coverage for each country.
THEME: Data resources may be grouped by hazard type, sector type, etc.

We also need to consider:

SELF-DEPENDENCY & COMPLETNESS: the data resource can be interpreted and used by itself.
EFFICIENCY: try to avoid creating huge datasets (>1 Gb) that would be hard to download on poor connections.

Where there are many resources for a dataset, there is a temptation to include a folder structure in Data Catalog. This does not enable easy access to resources. Datasets and Resources should be set up to facilitate easy finding of the specific component of analysis, and grouping resources together in a sensible fashion, without creating problematically large download sizes.

Decisions on how to structure risk data should be taken on a project-by-project basis, because there is a wide variety of how data are structured depending on the components of a project. However, here are a few examples:

Hazard data

Format / data types

Hazard data includes:

Return period hazard maps
Scenario/historical event footprints
Hazard curves
(Stochastic) event set tables
Historical event catalogue
River network / cyclone track / seismic fault databases
Other input files including flood protection data, intensity-duration-frequency curves, ground motion relationships, etc.

Generally, hazard data (footprints) takes the form of raster (geo grid) data (GeoTIFF / COG). Supporting data (hazard curves, historical catalogue) often come as tables (csv, xlsx) or vector data (gpkg, shp). They can also be packaged in a similar fashion.

Thematic grouping

The main thematic groupings in hazard data are:

Hazard type: data produced for seismic hazard, wildfire, fluvial flood, pluvial flood, etc.
Year: e.g., current, projected 2050, 2080, etc. using climate projections

Geographic grouping

Scale, location and resolution: Hazard data may be generated at global, regional, national, subnational, or urban level. High-resolution hazard data (e.g. urban level analysis) might be grouped for individual locations (city) whenever the dataset becomes too large.

In general, splitting raster datasets into smaller parts is not advised, according to self-dependency and completeness criteria. If required for data efficiency, always consider a larger extent than needed as to avoid cross-border artefacts.

[FIGURE EXAMPLE: BORDER CLIP vs EXTENT CLIP OF GLOBAL LAYER ON A COUNTRY]

Packaging recommendation

We recommend grouping exposure data in the following hierarchy:

Hazard type
- Geographic scale and location (country; sub-national; city)
- Year of data (current or projected)

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> hazard data
   * Dataset: <project name> <hazard1> RP maps
      * Dataset: <project name> <hazard1> data <country1>
         * Zipped Resource: <hazard1> <country1>_2020
         * Zipped Resource: <hazard1> <country1>_2050
         * Zipped Resource: <hazard1> <country1>_2080
      * Dataset: <project name> <hazard1> data <country2>
         * Zipped Resource: <hazard1> <country2>_2020
         * Zipped Resource: <hazard1> <country2>_2050
         * Zipped Resource: <hazard1> <country2>_2080
   * Dataset: <project name> <hazard2> RP maps
      * …
         * …
   * Dataset: <project name> <hazard1> historical catalog

Exposure data

Format

Exposure geospatial data can take the form of vector (gpkg, shp), or raster (GeoTIFF / COG). In some cases, exposure comes as table (csv, xls).

[EXAMPLE PIC FOR EACH FORMAT]

Geopackage (`.gpkg`) are preferred for vector data over shapefiles (`.shp`). Conversion from .shp to g.pkg is lossless and usually size-efficient. Where shp format is maintained, they should be provided as a zip folder containing the multiple components of the shapefile dataset (.shp, .dbf, .xml, .ovr, etc.).
Read more: (link to format page - next post)

Thematic grouping

The main thematic groupings in exposure data are:

Asset type / sector / construction type: e.g. Structure, Content, Product / Residential, Commercial / Masonry, Wood
Year: reference period or year, e.g., current, projected (2040-2060), etc.

Geographic grouping

Scale, location and resolution: Exposure data may be generated at global, regional, national, subnational, or urban level. High-resolution hazard data (e.g. urban level) might be grouped for individual locations (city) whenever the dataset becomes too large.

In general, splitting raster datasets into smaller parts is not advised, according to self-dependency and completeness criteria. If required for data efficiency, always consider a larger extent than needed as to avoid cross-border artefacts.

Packaging recommendation

We recommend grouping exposure data in the following hierarchy:

Geographic scale and location (country; sub-national; city)
- Year of data (current or projected)
- (optional) Sector or asset type (Residential; Commercial / Population, Buildings).

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> exposure data
   * Dataset: <project name> <country1> exposure data
      * Dataset: <project name> <country1> exposure data - 2020
         * Resource: <country1>_2020_exposure_RES
         * Resource: <country1>_2020_exposure_COM
         * Resource: <country1>_2020_exposure_EDU
         * Resource: <country1>_2020_exposure_ROAD
      * Dataset: <project name> <country1> exposure data - 2050
         * Resource: <country1>_2050_exposure_RES
         * Resource: <country1>_2050_exposure_COM
         * Resource: <country1>_2050_exposure_EDU
         * Resource: <country1>_2050_exposure_ROAD
   * Dataset: <project name> <country2> exposure data
      * …
         * …

Vulnerability data

Format

Vulnerability data are usually provided as table data (csv, xls) containing the impact model function and parameters. Often, vulnerability models are proprietary data and only shared as pictures; this has low reusability and should be avoided. Always try to obtain a mathematical description for this component.

Thematic grouping

The main thematic groupings in vulnerability data are:

Hazard type: e.g. Flood damage function; Earthquake fragility curves.
Asset type / sector / construction type: e.g. Structure, Content, Product / Residential, Commercial / Masonry, Wood

Geographic grouping

Vulnerability curves may be developed for individual countries or environments within a project. Where this is the case, this grouping should be retained.

Packaging recommendation

We recommend to group exposure data in the following hierarchy:

Hazard type
- Geographic (unless global function, one dataset per country)
- Asset type / sector / construction type: e.g. Structure, Content, Product / Residential, Commercial / Masonry, Wood

Note that this hierarchy should be maintained even when packing all the data in one file, e.g. multiple sheetx of an excel file.

[EXAMPLE OF MULTIPLE IMPACT MODELS IN ONE FILE]

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> vulnerability data
   * Dataset: <project name> <hazard1> vulnerability data
      * Resource: <hazard1>_RES_timber
      * Resource: <hazard1>_RES_RC
      * Resource: <hazard1>_COM_steel
      * Resource: <hazard1>_COM_RM
      * Or Resource: <hazard1>_vulnerability _curves_all_types (if data all in one file)
   * Dataset: <project name> <hazard2> vulnerability data
      * …
      * …

Loss data

Format

Loss data often comes in the form of:

tabulated event losses, and loss per exceedance probability
Mapped return period loss / annual average loss - in vector files/choropleth maps

Thematic grouping

The main thematic groupings in loss data are:

Hazard type: there may also be a multi-hazard loss metric included.
Asset type / sector: e.g. Structure, Content, Product / Residential, Commercial
Year: e.g., current, projected 2050, 2080, etc.

Geographic grouping

Losses are usually aggregated at national or subnational administrative level (ADM2, ADM1, or ADM0). Losses can also be provided per asset (e.g. individual buildings or raster footprints) but it is not usual - although these files are often generated by the analysts.

Packaging recommendation

We recommend grouping exposure data in the following hierarchy:

Hazard type
- Sector/asset type
- Year

NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY

For example:

* Dataset: <project name> loss data
   * Dataset: <project name> <hazard1> loss data 2020
      * Resource: <hazard1>_RES
      * Resource: <hazard1>_COM
      * Resource: <hazard1>_AllSectors
   * Dataset: <project name> <hazard2> loss data 2020
      * …
      * …
   * Dataset: <project name> <hazard1> loss data 2050
   * Dataset: <project name> <hazard1> loss data 2080

Data formats

Risk data can be made of spatial or non-spatial data.

Spatial data (geodata) can be shared in a variety of formats depending on the software used by the analyst. Over the years, OSGEO (Open Source Geospatial Foundation) tried to converge towards a limited number of "best" standard formats for each geospatial type. Below is a list of recommended and supported geodata formats.
Non-spatial data most often consist of table data stored as excel or csv files for greater compatibility.

Recommended geodata formats

Vector data: GeoPackage

GeoPackage (.gpkg) is an open, non-proprietary SQLite3 extended Database container. It is platform-independent and standards-based (OGC, QGIS, GDAL). Similar to ESRI geodatabase, but more responsive. It is a single-file format that can store anything from vector data and attributes, symbology, pyramids, table data as individual layers within one geopackage. It is possible to store rasters, but its supports for raster data is still limited and we don't recommend storing those as geopackage. Supports SQL and API to DB - fit for web applications, can export to PostGIS. There is no limit of attributes, attribute name size, or file size (unlike shapefile). Internal metadata specifications are under development.

Raster data: GeoTIFF / COG (.tif)

GeoTIFF (.tif) is the image standard file for GIS and satellite remote sensing applications. It can store multiple realisations as “bands”. GeoTIFFs can be accompanied by other auxiliary files (.tfw for raster geolocation, .xml for metadata, .aux for projections and others, .ovr for pyramids to improve visualisation). These should be packed together with the .tif files in a zip for sharing. A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. It does this by leveraging the ability of clients issuing HTTP GET range requests to ask for just the parts of a file they need. This is the best option for data that needs to be hosted ona geocatalogue such as GeoNode.

Supported geodata formats

Vector data

ESRI ShapeFile (SHP) Well established, de facto standard in the GIS community. Accepted by all GIS software. Format specifications are open, however it is a proprietary format (controlled by Esri). It can only contains one geometry type (point, line, polygon) per file. It is a multiple-parts file format (.shp for geometry, .dbf for table, .shx for indexining, .prj for CRS, other optional for encoding, indexes, etc.), thus all the parts needs to be packed together in a zip file. Note some limitations, such as attribute names being limited to 10 characters, and number of attributes (ie table fields) limited to 255. The file size is restricted to 2 GB.

Raster data

Network Common Data Form (NetCDF) NetCDF GIS format is an interface for array-oriented data for storing multi-dimensional variables. Commonly used in the scientific community for multidimensional geodata storage (e.g. climate data). Supported by ArcGIS and QGIS via toolbox conversion or extensions; most spatial processing tools require conversion into raster first.
GRIdded Binary or General Regularly-distributed Information in Binary (GRIB)
GRIB was standardized by the WMO and in operation since 1985. Similar to NetCDF, GRIB files are commonly used in meteorology to store historical and forecast weather data. It’s a multidimensional file with the advantages of self-description, flexibility and expandability. There are tools to convert GRIB into rasters such as grb2grid and QGIS software.

Recommended non-spatial formats

CSV Used for table data such as results summary, aggregations, etc. Deprecated for grid spatial data. Small files can be added uncompressed, so the resource filetype will show as ‘CSV’. Where large or multiple files are compressed, filetype will show as ‘ZIP’ though so please include reference to the .csv filetype in the resource description.
Excel Used for table data such as results summary, aggregations, etc. Deprecated for grid spatial data. Small files can be added uncompressed, multiple files should come in one zipfile. Please include reference to the .xls filetype in the resource description.
PDF Preferred format for reports and documentation. Add reports uncompressed whenever possible: users will commonly want to see the description for each report or document as one resource per file. Resource filetype will show as ‘PDF’.

WB data catalogue (DDH): update workflow

General

The Risk Data Library Collection sits within the World Bank Data Catalog and is meant to store standard risk data. The collection can be accessed from the collections page or used as a filter on the left bar to search for data within the collection. immagine

Adding datasets

Datasets can be submitted for review and publication on the Data Catalog by any World Bank Staff, ETC or STC. These people have the role of ‘Data depositor’.

Two approaches to upload data:

Individually: using the existing upload wizard
Bulk: for large number of datasets; requires support by the DDH team

Datasets can be added to the RDL Collection by the RDL team, after approval.

Individual datasets

Log in to the Data Catalog (top right bar): https://datacatalog.worldbank.org/int/home
View ‘My datasets’ (top right bar): https://datacatalog.worldbank.org/int/data/mydata
- This page shows datasets you have uploaded or for which you are listed as a contributor
- Shows dataset number, name, modified date, status (Published, Draft, Under review, Publishing in progress), and action (Edit or Submit for review)
Click ‘Add data’ (top right bar): https://datacatalog.worldbank.org/int/data/add Select the option on the right: continue.
1. ‘Essential Information’
2. ‘Data Resources’
  - Upload dataset from local storage according to the data preparation guidelines
  - Add a resource title and description
  - When one resource has been submitted, another one can be added
3. Additional information
  - Tags: These are important for being able to search the data in the catalog. Suggestions for RDLS data:
  - Climate Risk or Disaster Risk
  - Hazard, Exposure, Vulnerability, Loss (depending on the component type)
  - Flood, Earthquake, Landslide, Tsunami (hazard type)
  - Topics: There is currently no topic for risk analytics or climate and disaster risk - leave blank
  - Collection: this can only be entered by staff with those rights. Provide a list of dataset ID to Kamwoo Lee with request to assign data to RDL Colelction.

When all required (and optional) information has been entered, click on ‘Save as draft’. The dataset will appear in the your datasets list.

Under ‘Action’ you can edit or submit it for review by the DDH team.
When status is ‘Published’, the dataset will be visible on the World Bank Data Catalog.

Bulk upload

In cases where large volumes of project data should be uploaded, DDH team can assist with bulk upload. The workflow steps are:

Store project data in folders on OneDrive.
Create an excel spreadsheet describing the datatype with each dataset name, URL to data and URL to prepared JSON metadata.
Describe the data structure to be achieved on DDH.
DDH team will copy the data and metadata to DDH Sharepoint.
DDH team will use scripts to upload datasets; these will appear in your ‘My Datasets’ for review and any further editing.

Adding RDL metadata

Create metadata following to Risk Data Library schema in JSON format. Metadata should be created for each dataset, and includes the description and name of resources under that dataset. Either:
1. Write directly into JSON file
2. Use JSON metadata creation tool. This tool is standalone (not part of DDH). It exports a JSON file to be saved with the dataset.
Upload metadata with the dataset. Metadata will become available to download from the dataset page. This will contain the standard DDH metadata plus the RDL metadata.

Contacts

RDL Team

Mattia Amadio [mamadio@worldbank.org](mailto:mamadio@worldbank.org)
Stuart Fraser [sfraser@worldbank.org](mailto:sfraser@worldbank.org)
Pierre Chrzanowski [pchrzanowski@worldbank.org](mailto:pchrzanowski@worldbank.org)

DDH Team

Kamwoo Lee [klee16@worldbank.org](mailto:klee16@worldbank.org)
Gaurav Bhardwaj [gbhardwaj1@worldbank.org](mailto:gbhardwaj1@worldbank.org)
Rochelle O’Hagan [rohagan@worldbank.org](mailto:rohagan@worldbank.org) (DDH lead)

DDH team is responsible for review and publication of submitted datasets, and to assign datasets to RDL collection in short-term when he reviews data.

The content posted has been included in current draft. Please refer to repo version for any comment or edit.

GFDRR / rdl-standard

[Documentation] DRAFT input #42

Data preparation

Data structure

File formats

Packaging as resources

Best practice for risk data packaging

Hazard data

Format / data types

Thematic grouping

Geographic grouping

Packaging recommendation

Exposure data

Format

Thematic grouping

Geographic grouping

Packaging recommendation

Vulnerability data

Format

Thematic grouping

Geographic grouping

Packaging recommendation

Loss data

Format

Thematic grouping

Geographic grouping

Packaging recommendation

Data formats

Recommended geodata formats

Vector data: GeoPackage

Raster data: GeoTIFF / COG (.tif)

Supported geodata formats

Vector data

Raster data

Recommended non-spatial formats

WB data catalogue (DDH): update workflow

General

Adding datasets

Individual datasets

Bulk upload

Adding RDL metadata

Contacts

RDL Team

DDH Team