matamadio commented 3 years ago

Review of SEA DRIF datasets and assessment of procedure for inclusion in RDL

Web tool at https://tool.oi-analytics.com

immagine

Metrics presented

Infrastructure failure probability
- By infrastructure sector
Probabilistic estimates of infrastructure damage
- Expected annual damages (EAD) – At asset level
- Loss-probability distributions at province level by hazard and sector
Probabilistic estimates of wider economic losses
- Expected annual economic losses (EAD) – At asset level
- Loss-probability distributions at province level by hazard and sector
CBA of some resilience interventions: retrofit (Vietnam only)

Overview

The website serves pre-processed hazard, exposure and risk geospatial layers. Most data, both hazard and exposure, are served as tiles of vector points. Cyclone hazard and flood levels in vietnam seem to be stored as raster instead. The point grids do not exactly match with the one provided in demo data, but are comparable in resolution.
Exposure data are from OSM (all vector data aggregated as grid)
Hazard data (river and coastal flood; cyclone) is derived from well known open sources, respectively WRI aqueduct and Ibtracts. All these data are originally raster, converted into point vector data for modelling, and tiled for the webservice. Because they still hold only one intensity value, this conversion isn't really optimal.
Data layers are stored as external files, not in a PostGIS DB.
Oxford mentioned that the model output is a mapinfo tab file, which according to them is the only one allowing to store the full range of names and values. Im a bit skeptical about this, will test conversions from demo data.
The data transfer to client uses json and PBF format (ProtocolBuffer Binary Format, a standard format from OSM), which seems relatively efficient for tiling and subsetting (cyclone example: https://tool.oi-analytics.com/data/cyclone.json).
On the other side, they said that website data amounts to roughly 500GB, despite a relatively restricted spatial extent. This could be likely improved with some basic data optimisation.
- If it is hazard point data taking most of the space, those could be more efficiently stored and serverd as COGs resulting in less costs for us and less friction for the user.
- Cost of storing data as tiled vectors may be cheaper initially, but as more data is added this cost my explode and so in the long term changing the underlying infrastructure towards something raster based may be more effective.

**Suggested procedure for RDL ingestion (draft)**

Evaluation of SE Asia risk data and platform
Support to SE Asia risk data conversion and integration on the Risk Data Library
- Optimise original output (keep only key fields in .dbf table; detach data) and check if size is acceptable;
- Align metadata to RDL schema
- Include original hazard datasets as rasters
- ? Include original OSM data (non aggregated) ?
Support to Oxford team on improving data storage and processing efficiency
- suggest different formats/workflows if size not acceptable for storage and processing.
Support to SE Asia prototype platform hosting

Outputs:

SE Asia data openly available on the Risk Data Library and compliant with the Risk Data Library Standard
SE Asia data storage and processing improved
GFDRR Blog post on SE Asia data to promote the work

Team:

Mat
Takuya
Pierre
Maybe Jean on data storage and processing improvement

Estimated number of days and budget:

20 Staff and STCs days

matamadio commented 3 years ago

Example Data package from Gordon

immagine

Sample details

Country

Myanmar

Global datasets

Fluvial and coastal flooding: WRI aqueduct Cyclones: STORM IBTrACS model Road links: OpenStreetMap + OSRM Railways tracks: OpenStreetMap Electricity transmission lines: Gridfinder Fragility: Koks et al. 2019; Miyamoto et al. 2019; Habermann and Hedel 2018 Socio-economic impacts: Grided Population Density/Count (WorldPop Population Data); Gridded GDP per capita (DRYAD Gridded GDP) Costs: Koks et al. (2019), World Bank ROCKS database; World Bank PPI database.

Data types

Data in 4 folders:

Inputs
- Adm units
- OSM rails and roads
- Tree cover
- Electric Grid
- Cyclone hazard
- River Flood hazard (RP100 2050-RCP45)
- Coastal Flood hazard (RP100 Historical)
- Wind speed
- Pollution (NOX emissions)
Exposure (combination of inputs: hazard and asset at feature level)
- Electricity
- Rail
- Road
Risks (same as exposure, plus cost output - all attributes are maintained)
- Electricity (cyclone only)
- Rail (3 hazards)
- Road (3 hazards)
Summary (aggregation of Risks at ADM1 level)
- Electricity (cyclone only)
- Rail (3 hazards)
- Road (3 hazards)

The columns listed below (and described above, in Risks) are also aggregated to the Admin1 level for 4. Summary. Depending on the type of data, the outputs may be per RP, epoch and scenario OR Annual (please see Risks, above, for details):

"assetDamage" (km)
"minEventCost" ($M USD direct cost)
"maxEventCost" ($M USD direct cost)
"gdp" ($M USD indirect cost)
"primary-gdp" ($M USD indirect cost, road / rail only)
"secondary-gdp" ($M USD indirect cost, road / rail only)
"tertiary-gdp" ($M USD indirect cost, road / rail only)
"expectedLengthDamaged" (km / year)
"minEAD" ($M USD / year direct cost)
"maxEAD" ($M USD / year direct cost)
"EAEL-gdp" ($M USD / year indirect cost all activities)
"EAEL-primary-gdp" ($M USD / year indirect cost primary industries only, road / rail only)
"EAEL-secondary-gdp" ($M USD / year indirect cost secondary industries only, road / rail only)
"EAEL-tertiary-gdp" ($M USD / year indirect cost tertiary industries only, road / rail only)

Geodata format review

All data use global EPSG: 4326 - WGS-84
Uncompressed size of data for Myanmar: 10 Gb
- Raster data are used mostly for hazards and provided as .asc, uncompressed text format. Could easily be converted into COGs since has better compression, pyramids and overall usability.
- Vector data are used for Exposure and risk and are provided as MapInfo (.mid +. mif), which is uncompressed, text-readble format for both geometries (.mif) and db (.mid). Storage size is comparable to shp and gpkg.
- Each step from input to risk adds to the same table, so Risk includes all Hazard and Exposure extracted value, plus cost estimates. This cause huge tables.
Summary datasets consist of csv with rows (= nADM units) x n_attributes (max > 1,000). Csv are very small in storage size compared to the equivalent vector-joined dataset.
Data size is largely allocated as MapInfo datasets: large DB table/associated metadata, that goes up to 2,000 attributes for some datasets, especially exposure and risk sets where there are lot of combinations [table = (model x climate scenario x hazard RP x cost type)]. The geometry itself is rather small in compare.

MapInfo PROs:

Suited for the current implementation of OIA webtool via OGR (although it gets first converted into GeoJson for web view) >> More? @ConnectedSystems
no limitations for field names or n of fields: exposure layers go up to 2,000 atttributes (scenario x RP x cost type).

MapInfo CONs:

It is a propietary format, we want open standards.
Huge friction for user GIS tools, huge tables heavily affects the performance of load/process for the average user:
- Bad performance on QGIS3.6-3.14
- Compatibility issue on ArcGIS10, MapInfo10 (unable to load)

Geodata content

Hazard

The full sets of hazards addresses:

Hazard	Probabilities	Intensities and spatial extents	Climate scenario information
WRI Aqueduct flood hazard Fluvial (river) Coastal flooding with subsidence (median value)	1/2, 1/5, 1/10, 1/25, 1/50, 1/100, 1/250, 1/500, and 1/1000	Flood depths in meters over 30 arc second grid squares.	1 historical and 5 future climate models RCP 4.5 and 8.5 emission scenarios Current and future maps in 2030, 2050, 2080
Cyclones from STORM IBTrACS model	28 different probabilities from 1/10 to 1/10000	3-hour time step wind gust speeds in m/s at 0.1-degree grid squares.	None

Exposure

Generalised asset costs information are applied to estimate exposed value, as described in Table 3-3 of report. In some instances, they assumed a ±25% uncertainty in our cost estimations in line with the assumptions from Koks et al. (2019).

Vulnerability/Fragility

Figure 3-1 shows direct damage (fragility) curves for assets from different studies. Since having one fragility curve is not ideal for such a generalised context, Koks et al. (2019) suggested adding uncertainty to the fragility information and used five curves (derived from the original) to test the sensitivity of damage estimates to different fragility values immagine Figure 3-1: Generalised direct damage (fragility) curves vs flood depths for different types of infrastructure assets. (a) paved roads (from Koks et al. 20195); (b) unpaved road (from Koks et al. 20195); (c) railway lines (from Koks et al. 20195); and (d) power plants (from Miyamoto et al. 201943); (e) airports (from Habermann and Hedel 201844), (f) ports (based on expert judgment). The fragility curve for airports mainly represents flood damage to runways. The fragility curve for ports is based on expert input from a large port authority, details of which we cannot disclose. The boldest lines (State 1) are used in the original studies while the other curves (State 2 – State 5) are derived from the original curve by multiplying by 2, 3, 4, 5.

immagine Figure 3-2: Generalised damage probability (fragility) curves vs wind speeds for different types of assets.

Impacts and risk

The main outputs are presented as three metrics:

Probability of failure – The total annual probability of failure of an asset when exposed to hazards of different exceedance probabilities (1/return periods). This is estimated by multiplying the asset fragility for a given hazard exceedance probability, and then summing over all the products of the fragilities and hazards exceedance probability for a givenclimate scenario and time epoch.
Expected annual damages (EAD) – Estimated for an asset, this is the integral over the hazard exceedance probabilities and the corresponding direct damage value in US$ calculated with the asset fragility function and reconstruction cost associated with the asset.
Expected annual economic losses (EAEL) – Estimated for an asset, this is the integral over the hazard exceedance probabilities and the corresponding GDP loss in US$/day associated with the failure of the asset.

Important attributes:

The probability of asset failure associated with the given exposure has been calculated, and added as a new attribute for each attribute representing hazard exposure. These attributes are the same as the exposure data, but have the word "pFail_" prepended (probabilities of failure are dimensionless):
- pFail__RP for present-day, for example "pFail_cyclone_RP00090"
- pFail_RP, for example "pFail_NS_2030_RP0002")
In addition, the expected length of asset damaged by each event has been calculated, and is stored in a new attribute with the word "assetDamage_" prepended (expected lengths of asset damaged are stored in km):
- assetDamage__RP for present-day, for example "assetDamage_cyclone_RP00090"
- assetDamage_RP, for example "assetDamage_NS_2030_RP0002")
The direct damage associated with the exposure (the cost of reinstating the predicted length of damaged infrastructure) has also been saved on the asset (stored in $M USD) and can be identified by "minEventCost" or "maxEventCost" prepended to the exposure hazard attribute and represent the minimum and maximum anticipated cost of works:
- minEventCost__RP for present-day, for example "minEventCost_cyclone_RP00090"
- maxEventCost__RP for present-day, for example "maxEventCost_cyclone_RP00090"
- minEventCost_RP, for example "minEventCost_NS_2030_RP0002")
- maxEventCost_RP, for example "maxEventCost_NS_2030_RP0002")
The indirect cost (value of GDP flowing through the asset) associated with the event damage has also been saved on the asset which can be identified by the prepended term "gdp_" (and which are stored in $M USD / year):
- gdp__RP for present-day, for example "gdp_cyclone_RP00090"
- gdp_RP, for example "gdp_NS_2030_RP0002")
By integrating over all RP in an epoch and scenario, the annual probability of asset failure which has been stored in a new attribute with "annualProbability_" prepended (which is a dimensionless probability of failure) and the RP
value has been dropped from the attribute name:
- annualProbability_ for present-day, for example "annualProbability_cyclone"
- annualProbability, for example "annualProbability_NS_2030")
Multiplying the annual probaiblity by the length of the asset, the expected annual length of asset failing under a nominated hazard (for a given epoch and scenario) yields the expected length of asset damaged (in km) and can be identified by the word "expectedLengthDamaged_":
- expectedLengthDamaged_ for present-day exposures, for example "expectedLengthDamaged_cyclone"
- expectedLengthDamaged_, for example "expectedLengthDamaged_NS_2030")
The expected annual direct damages (EAD) associated with a combination of hazard, epoch and scenario have been calculated for every asset. These values - identified by the prepended word "minEAD" or "maxEAD" representing the range of expected annual damage associated with the hazard, for the nominated epoch and scenario, and is stored in $M USD / year:
- minEAD_ for present-day exposures, for example "minEAD_cyclone"
- maxEAD_ for present-day exposures, for example "maxEAD_cyclone"
- minEAD, for example "minEAD_NS_2030")
- maxEAD, for example "maxEAD_NS_2030")
The indirect losses associated with the predicted annual damage to every asset has been calculated. These values - identified by the prepended word "EAEL-gdp_" representing the anticipated loss to GDP ($M USD / year) associated with the hazard, for the nominated epoch and scenario:
- EAEL-gdp_ for present-day exposures, for example "EAEL-gdp_cyclone"
- EAEL-gdp, for example "EAEL-gdp_NS_2030")
Additionally, for road and rail assets it has been possible to estimate the split between primary, secondary and tertiary contribution to GDP in addition to "gdp" as described in 4 and 8. Where these data are available, additional attributes have been stored in the risk datasets which are identified by the prepended words "primary-gdp", "secondary-gdp", "tertiary-gdp" for individual events and by "EAEL-primary-gdp", EAEL-primary-gdp", and "EAEL-primary-gdp" for annual indirect losses. All data are stored in $M USD / year:
- tertiary-gdp__RP for present-day, for example "tertiary-gdp_cyclone_RP00090"
- secondary-gdp_RP, for example "secondary-gdp_NS_2030_RP0002")
- EAEL-primary-gdp_ for present-day exposures, for example "EAEL-primary-gdp_cyclone"
- EAEL-tertiary-gdp, for example "EAEL-tertiary-gdp_NS_2030")

Actions to host datasets into RDL

Align metadata to schema

Hazard: suggest regional clip of global scenarios (Ibtracs and WRI flood) over SEA region; else link to source download page. Metadata alignment should be straightforward.
Exposure datasets could be aligned to GED4ALL or other taxonomies > To be investigated more in detail
- Reduce fields to those strictly related to Exp schema > remove hazard intensity attributes
- Pollution hazard is not used per se but as proxy of traffic and thus as proxy of indirect losses. May be presented directly in exposure layers as value density (cost) indicator.
Vulnerability: to be extracted from report and translated into functional sets > Subdued to general V ingestion approach TBD
Risk/summary: we have both "impacts" calculated for each individual asset feature, and risk aggregation at ADM level. Both have too big of a table and should be reduced to key attributes only. Asset risk and summary to be put as different layers in the same data package. Summary CSV are joined to ADM geometries for visualisation and distribution.

Choose storage/distribution format

There are a number of alternative options that reflect different a) curation effort, b) friction for the user, c) compatibility with online tool and d) storage size criteria. A lot depends on the requirements of the analytics webtool, and if it will be discontinued. Please also note that the webtool does not use/show all the attributes in the datasets, only a few.

Option	Curation effort	Friction for user	Compatibility with Webtool	Storage size
Option 1	Low	Worst	Best	Bad
Option 2	High	Best	Best	Worst
Option 3	High + Dev	Best	Best	Best

Option 1

Host all datasets as their original format (ASC; MapInfo)

Option 2

Host all datasets as their original format (ASC; MapInfo) for the webtool
Make a selection of data and convert for public download:
- Hazards as COGs
- Exposure as GPKG; only key Exposure attributes in the table, or split large attribute tables into more layers
- Risk sets (hazard x exposure) excluded
- Summary to represent Loss as GPKG (csv joined to ADM1 geom)

Option 3

Convert all data into RDL-friendly format (COGs; GPKG; others) for public download
Adapt the analytics code to work on those formats instead of MapInfo

Value for RDL

This case study is a good start to settle how we are going to ingest infrastructural risk data and align to existing standards. More of the same type are expected from future projects.
Set an example and roadmap related to challenges of format conversion, which are also expected in the future
Proof the capacity of L schema to hold different both impact and risk information
The online analytics tool and relative feedback can help the development of similar upcoming risk analytical tools

Time and feasibility

Depends on the chosen curation/storage option. The metadata aligment step requires the same time for all three, quantified in 4-5 full days. The data curation effort varies a lot.

Option 1 is the quickest as we simply dump all their data without any processing. Could be done in relatively short time.
Option 2 requires dump of original data AND manual processing of all datasets; a first selection of data can be added by the end of May, while others sets will take more time
Option 3 requires manual processing of datasets AND revision of tool code, as such it will take more people/day (need @ConnectedSystems on code revision)

matamadio commented 3 years ago

If we consider only input data to the analysis, I tested that size of data can be made quite small for one country by simple format change.
- Original input (hazard scenarios and exposure asset as mif/asc): 914 Mb
- After conversion to gpkg/tif: 170 Mb (160 are just the road network!)
Exposure data to drop all hazard attributes (become same as "input", negligible size)
Optimisation of risk data can reduce size to 1/4 - 1/6 by just keeping the analysis output, and dropping all byproduct attributes.
Example for electric grid x cyclone hazard keeping only 4 risk attributes (lenght_damaged, minEAD, maxEAD, EAEL_gdp):
- Risk: 90 mb as .mid >> 17.5 mb as .gpkg
- Summary: 16 kb as .csv >> 5 mb as .gpkg

So the final size of optimised data could be around 1/5 of original data, relevant for Option 2 and 3.

matamadio commented 3 years ago

As Pierre confirmed that the online tool will be discontinued, we just focus our effort on using the data for RDL services, without need to link them to the current web application. Then the option is simply:

Convert all data into RDL-friendly format (COGs; GPKG; others) for public download

This should be done by mid June, but depends on how fast they provide the rest of the data, and how fast we solve any arising conversion issues.

matamadio commented 3 years ago

Following up conversation with @stufraser1 and @jeanpommier:

Inclusion of global hazard data from some third parties (WRI; IbTracs) may follow the same approach tbd in GFDRR/rdl-standard#31. Before that is set, we can include global sets as reformatted (tif) regional clips of data (e.g. SEA region extent).
Inclusion of Exposure (OSM, OSMR, Gridfinder) as original source + socioeconomic + cost attributes + taxonomy into gpkg format
Inclusion of Impact/Risk database as csv that can be joined to Exposure data based on ID, for both asset level and ADM1 level (+) Better for table data analysis (+) Better storage (+) User choose only the data of interest for load in GIS and spatial representation (quicker, lighter) (-) Join done manually by user; on 1000+ rows it can take a while
Inclusion of readme explaining each attribute

GFDRR / rdl-standard

[DATA] Inclusion of OIA-SEA datasets #33

Web tool at https://tool.oi-analytics.com

Metrics presented

Overview

**Suggested procedure for RDL ingestion (draft)**

Example Data package from Gordon

Sample details

Country

Global datasets

Data types

Geodata format review

MapInfo PROs:

MapInfo CONs:

Geodata content

Hazard

Exposure

Vulnerability/Fragility

Impacts and risk

Actions to host datasets into RDL

Align metadata to schema

Choose storage/distribution format

Option 1

Option 2

Option 3

Value for RDL

Time and feasibility