Closed matamadio closed 1 year ago
Suggestions on the packaging of resources / hierarchy within datasets https://docs.google.com/document/d/1PgL4AYvAJVQ74TMceCeGbdEkzphCnh1nMqCszFk9T_0/edit?usp=share_link
Started drafting a condensed version of the shared doc. @stufraser1 please edit this comment for small changes, otherwise if major changes please do in new comment.
The data structure and packaging of the output as obtained from the data analysts may not always align with the way we want users of the RiskDataLibrary to search and download data.
Datasets shared in risk catalogues (e.g. Risk Data Library Collection&q=&start=0&sort=last_updated_date%20desc)) are provided as individual RESOURCES, which should be packed (grouped) according to two main criteria:
We also need to consider:
Where there are many resources for a dataset, there is a temptation to include a folder structure in Data Catalog. This does not enable easy access to resources. Datasets and Resources should be set up to facilitate easy finding of the specific component of analysis, and grouping resources together in a sensible fashion, without creating problematically large download sizes.
Decisions on how to structure risk data should be taken on a project-by-project basis, because there is a wide variety of how data are structured depending on the components of a project. However, here are a few examples:
Hazard data includes:
Generally, hazard data (footprints) takes the form of raster (geo grid) data (GeoTIFF / COG
).
Supporting data (hazard curves, historical catalogue) often come as tables (csv
, xlsx
) or vector data (gpkg
, shp
).
They can also be packaged in a similar fashion.
The main thematic groupings in hazard data are:
In general, splitting raster datasets into smaller parts is not advised, according to self-dependency and completeness criteria. If required for data efficiency, always consider a larger extent than needed as to avoid cross-border artefacts.
[FIGURE EXAMPLE: BORDER CLIP vs EXTENT CLIP OF GLOBAL LAYER ON A COUNTRY]
We recommend grouping exposure data in the following hierarchy:
NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY
For example:
* Dataset: <project name> hazard data
* Dataset: <project name> <hazard1> RP maps
* Dataset: <project name> <hazard1> data <country1>
* Zipped Resource: <hazard1> <country1>_2020
* Zipped Resource: <hazard1> <country1>_2050
* Zipped Resource: <hazard1> <country1>_2080
* Dataset: <project name> <hazard1> data <country2>
* Zipped Resource: <hazard1> <country2>_2020
* Zipped Resource: <hazard1> <country2>_2050
* Zipped Resource: <hazard1> <country2>_2080
* Dataset: <project name> <hazard2> RP maps
* …
* …
* Dataset: <project name> <hazard1> historical catalog
Exposure geospatial data can take the form of vector (gpkg
, shp
), or raster (GeoTIFF / COG
).
In some cases, exposure comes as table (csv
, xls
).
[EXAMPLE PIC FOR EACH FORMAT]
Geopackage (`.gpkg`) are preferred for vector data over shapefiles (`.shp`). Conversion from .shp to g.pkg is lossless and usually size-efficient. Where shp format is maintained, they should be provided as a zip folder containing the multiple components of the shapefile dataset (.shp, .dbf, .xml, .ovr, etc.).
Read more: (link to format page - next post)
The main thematic groupings in exposure data are:
In general, splitting raster datasets into smaller parts is not advised, according to self-dependency and completeness criteria. If required for data efficiency, always consider a larger extent than needed as to avoid cross-border artefacts.
We recommend grouping exposure data in the following hierarchy:
NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY
For example:
* Dataset: <project name> exposure data
* Dataset: <project name> <country1> exposure data
* Dataset: <project name> <country1> exposure data - 2020
* Resource: <country1>_2020_exposure_RES
* Resource: <country1>_2020_exposure_COM
* Resource: <country1>_2020_exposure_EDU
* Resource: <country1>_2020_exposure_ROAD
* Dataset: <project name> <country1> exposure data - 2050
* Resource: <country1>_2050_exposure_RES
* Resource: <country1>_2050_exposure_COM
* Resource: <country1>_2050_exposure_EDU
* Resource: <country1>_2050_exposure_ROAD
* Dataset: <project name> <country2> exposure data
* …
* …
Vulnerability data are usually provided as table data (csv
, xls
) containing the impact model function and parameters.
Often, vulnerability models are proprietary data and only shared as pictures; this has low reusability and should be avoided. Always try to obtain a mathematical description for this component.
The main thematic groupings in vulnerability data are:
Vulnerability curves may be developed for individual countries or environments within a project. Where this is the case, this grouping should be retained.
We recommend to group exposure data in the following hierarchy:
Note that this hierarchy should be maintained even when packing all the data in one file, e.g. multiple sheetx of an excel file.
[EXAMPLE OF MULTIPLE IMPACT MODELS IN ONE FILE]
NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY
For example:
* Dataset: <project name> vulnerability data
* Dataset: <project name> <hazard1> vulnerability data
* Resource: <hazard1>_RES_timber
* Resource: <hazard1>_RES_RC
* Resource: <hazard1>_COM_steel
* Resource: <hazard1>_COM_RM
* Or Resource: <hazard1>_vulnerability _curves_all_types (if data all in one file)
* Dataset: <project name> <hazard2> vulnerability data
* …
* …
Loss data often comes in the form of:
The main thematic groupings in loss data are:
Losses are usually aggregated at national or subnational administrative level (ADM2, ADM1, or ADM0). Losses can also be provided per asset (e.g. individual buildings or raster footprints) but it is not usual - although these files are often generated by the analysts.
We recommend grouping exposure data in the following hierarchy:
NOTE THAT EACH PROJECT WILL HAVE ITS OWN NUANCES WHICH MAY REQUIRE AN ALTERNATIVE PACKAGING HIERARCHY
For example:
* Dataset: <project name> loss data
* Dataset: <project name> <hazard1> loss data 2020
* Resource: <hazard1>_RES
* Resource: <hazard1>_COM
* Resource: <hazard1>_AllSectors
* Dataset: <project name> <hazard2> loss data 2020
* …
* …
* Dataset: <project name> <hazard1> loss data 2050
* Dataset: <project name> <hazard1> loss data 2080
Risk data can be made of spatial or non-spatial data.
Spatial data (geodata) can be shared in a variety of formats depending on the software used by the analyst. Over the years, OSGEO (Open Source Geospatial Foundation) tried to converge towards a limited number of "best" standard formats for each geospatial type. Below is a list of recommended and supported geodata formats.
Non-spatial data most often consist of table data stored as excel or csv files for greater compatibility.
GeoPackage (.gpkg
) is an open, non-proprietary SQLite3 extended Database container. It is platform-independent and standards-based (OGC, QGIS, GDAL). Similar to ESRI geodatabase, but more responsive. It is a single-file format that can store anything from vector data and attributes, symbology, pyramids, table data as individual layers within one geopackage. It is possible to store rasters, but its supports for raster data is still limited and we don't recommend storing those as geopackage. Supports SQL and API to DB - fit for web applications, can export to PostGIS. There is no limit of attributes, attribute name size, or file size (unlike shapefile). Internal metadata specifications are under development.
GeoTIFF (.tif
) is the image standard file for GIS and satellite remote sensing applications. It can store multiple realisations as “bands”. GeoTIFFs can be accompanied by other auxiliary files (.tfw for raster geolocation, .xml for metadata, .aux for projections and others, .ovr for pyramids to improve visualisation). These should be packed together with the .tif files in a zip for sharing.
A Cloud Optimized GeoTIFF (COG) is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. It does this by leveraging the ability of clients issuing HTTP GET range requests to ask for just the parts of a file they need. This is the best option for data that needs to be hosted ona geocatalogue such as GeoNode.
Network Common Data Form (NetCDF) NetCDF GIS format is an interface for array-oriented data for storing multi-dimensional variables. Commonly used in the scientific community for multidimensional geodata storage (e.g. climate data). Supported by ArcGIS and QGIS via toolbox conversion or extensions; most spatial processing tools require conversion into raster first.
GRIdded Binary or General Regularly-distributed Information in Binary (GRIB)
GRIB was standardized by the WMO and in operation since 1985. Similar to NetCDF, GRIB files are commonly used in meteorology to store historical and forecast weather data. It’s a multidimensional file with the advantages of self-description, flexibility and expandability. There are tools to convert GRIB into rasters such as grb2grid and QGIS software.
The Risk Data Library Collection sits within the World Bank Data Catalog and is meant to store standard risk data. The collection can be accessed from the collections page or used as a filter on the left bar to search for data within the collection.
Datasets can be submitted for review and publication on the Data Catalog by any World Bank Staff, ETC or STC. These people have the role of ‘Data depositor’.
Two approaches to upload data:
Datasets can be added to the RDL Collection by the RDL team, after approval.
Log in to the Data Catalog (top right bar): https://datacatalog.worldbank.org/int/home
View ‘My datasets’ (top right bar): https://datacatalog.worldbank.org/int/data/mydata
Click ‘Add data’ (top right bar): https://datacatalog.worldbank.org/int/data/add Select the option on the right: continue.
‘Essential Information’
‘Data Resources’
Additional information
When all required (and optional) information has been entered, click on ‘Save as draft’. The dataset will appear in the your datasets list.
In cases where large volumes of project data should be uploaded, DDH team can assist with bulk upload. The workflow steps are:
DDH team is responsible for review and publication of submitted datasets, and to assign datasets to RDL collection in short-term when he reviews data.
The content posted has been included in current draft. Please refer to repo version for any comment or edit.
Issue to draft the content of next documentation while it's being restructured.
SEE ONLINE PREVIEW
Data preparation
This section is needed to give guidance on dataset formats and packaging.
Data structure
File formats
Check Geonode guidance instructions on how to optimise geodata (compression, pyramids)
Packaging as resources