Exposure taxonomies format

GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.

https://docs.riskdatalibrary.org/

Creative Commons Attribution Share Alike 4.0 International

16 stars 1 forks source link

Exposure taxonomies format #18

Open stufraser1 opened 3 years ago

stufraser1 commented 3 years ago

Exposure import currently relies on python script to import from .nrml Adapt hazard import .py to pull data into exposure DB from csv/geojson?

matamadio commented 3 years ago

The current approach by GEM is openquake-centered: is based on their EQ buildigs taxonomy and makes use of online taxonomy tool and input preparation tool - login is required. This creates significant friction in the data preparation workflow. We need to address this asap if we want demo data for exposure online soon. I see three alternatives:

keep the current openquake taxonomy workflow, somehow trying to integrate it in RISKi
- still would require uploader to go through GEM taxonomy form page for each dataset -friction-
- applies only to buildings (no infrastructures, land uses, or else), only for shapefiles
- overkill when source data do not provide so much details, like on the average case of OSM-derived data.
change it to include the whole GED4all standard schema and not just openquake
provide our own simplified taxonomy by adding non-mandatory fields based on enum in the exposure schema, e.g.
- main building material
- number of storeys above ground

pzwsk commented 3 years ago

Thank you Matt,

This is something I was wondering while reading the documentation and discussing with Open Cities colleagues. The GED4ALL taxonomy is a very comprehensive but one that is very specific to earthquakes, right?

Why not using OSM tags directly and building extensions for non-physical assets such as socio-economic indicators?

https://wiki.openstreetmap.org/wiki/GED4ALL

PS: we miss a decision process for any enhancement of the schema. I would like that type of issue to be discussed in the open with people from GEM and others.

Best,

stufraser1 commented 3 years ago

GED4ALL is the taxonomy created under the Challenge Fund, it accounts for attributes relevant for all the hazards considered in RDL; it was based on GEM taxonomy which was earthquake specific. It is very comprehensive, and includes all the data in one string which is unusual compared to other schema, which separate them in other columns, but is also already simplified from the GEM taxonomy, which was very engineering-focussed. It should definitely be discussed with GEM and others because other links in the schema are now predicated on the use of this taxonomy. Same for the V schema issue - suggestion it should be cut down is a significant departure from the challenge fund work and not one we should take lightly without consultation with the development partners - a lot of work went into designing a comprehensive schema that could be applied to different V/F functions and accommodate all the details about them while being somewhat future-proofed, not just for for current availability of open V data.

On Thu, 28 Jan 2021 at 10:56, Pierre Chrzanowski notifications@github.com wrote:

Thank you Matt,

This is something I was wondering while reading the documentation and discussing with Open Cities colleagues. The GED4ALL taxonomy is a very comprehensive but one that is very specific to earthquakes, right?

Why not using OSM tags directly and building extensions for non-physical assets such as socio-economic indicators?

https://wiki.openstreetmap.org/wiki/GED4ALL

PS: we miss a decision process for any enhancement of the schema. I would like that type of issue to be discussed in the open with people from GEM and others.

Best,

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/GFDRR/rdl-data/issues/8#issuecomment-768971616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7PNYUAVFFL4Q2GIGORD6DS4E7E5ANCNFSM4UJDX6SA .

pzwsk commented 3 years ago

Thank you Stu,

but then I don't understand your comments @MamadioCMCC.

The current approach by GEM is openquake-centered: is based on their EQ buildigs taxonomy and makes use of online taxonomy tool and input preparation tool - login is required. This creates significant friction in the data preparation workflow.

Why can't we just use the RDL documentation?

stufraser1 commented 3 years ago

Mat is correct, creation of the import file relies on the OpenQUake taxonomy generator (which doesn't account for the types of asset Mat included) - and may not account for all of the additional attributes required in GED4ALL (not fully tested). The issue is around how we generate the taxonomy string - and we could specify in guidelines that a full taxonomy string is possible, but we recommend a short string that works for most data. It may require replicated in RDL docs, some of the GEM taxonomy guidance, but unavoidable I think. One of the support tools we need is to map from separate occupancy/construction/yearbuilt fields into GEM taxonomy (I am working on an open framework to do this with IDF - GED4ALL to OED, which we could leverage)

On Thu, 28 Jan 2021 at 11:15, Pierre Chrzanowski notifications@github.com wrote:

Thank you Stu,

but then I don't understand your comments @MamadioCMCC https://github.com/MamadioCMCC.

The current approach by GEM is openquake-centered: is based on their EQ buildigs taxonomy and makes use of online taxonomy tool and input preparation tool - login is required. This creates significant friction in the data preparation workflow.

Why can't we just use the RDL documentation?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/GFDRR/rdl-data/issues/8#issuecomment-768982676, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7PNYRT7I6RJRYR33VPGTDS4FBOJANCNFSM4UJDX6SA .

matamadio commented 3 years ago

Thanks Stu, I was about to rephrase it to explain better. The point is, the implementation of EXP import process was done by GEM and doesn't match the GED4all schema, but rather the openquake platform: buildings only, and very detailed engineering info about each component of the bulding. So if we follow that, we can only describe exp that is made of vector buildings footprints. The GED4all exposure and taxonomy schema was developed for a broader purpose and it is consisent with OSM data (very convenient since OSM is more and more often the source of exposure datasets), as from the wikipage.

I conveyed most of it in our docs under exp>taxonomy: GED4all since we specifically called the exp schema with that taxonomy name. It is a redux version which covers the cases of interest (buildings, infra, indicators, etc) while leaving out some very specific ones (e.g bridges and some lifelines attributes).

I understand that taking a standard and trimming it to our purpose kinda defies the aim of the standard itself, and that applies to the V schema discussion as well. Also, changing a shared standard requires shared discussions.

I guess at the end schema implementation doesn't care of how many attributes are included; then it is not an issue of how big and comprehensive the schema is, but rather how it is shown and managed by the user. If the import workflow is manual and the user needs to look through 100 attributes codenames in the documentation to import data, that is not really sustainable. In a guided hypotesis where there is an import GUI (similar to taxtool, or geonode import page) that split more general metadata (contribution etc) from tables of specific parameters, that probably would not impact the user as much. But then again, one thing is setting a taxonomy string for one consistent dataset (e.g: roads.shp, all have same general tax code). Another thing is having 34k features (e.g. individual buildings), each with their own specific and detailed taxonomy string. I did spent time for manual translation of SWIO-RAFI tax strings and understood that is something nobody will (want to) do just to ingest data on a libray.

See prev discussion on the import workflow: https://docs.google.com/document/d/1Dn36mcllloqlilk4_AfUft1z7ni6eR0v8ROhJKuR-kU/edit#heading=h.d8n0zgylaje4

pzwsk commented 3 years ago

Thanks to both, this clarifies things but also reinforces my opinion that we should focus and document the schema and related data exchange format first, not the database.

We will never achieve an automated import workflow, but we may achieve more and more users adopting similar data exchange formats and so reducing friction between data platforms.

Most of our discussions are around that issue, import files, output files, etc.

Best,

matamadio commented 3 years ago

Note that this whole discussion applies only to vector footprints data where taxonomy is known, whereas many vector datasets come empty of any tax details and other datasets come as aggregated raster, so also void of taxonomy (that is the case of many under review). Instead they include broader definitions such as categories, (eg "residential", "population") which are already covered by other schema fields. So the first suggested change would be to make the ged4all.asset.taxonomy attribute optional instead of required.

Then, I'm thinking of a "taxonomy-agnostic" option, also considering how data and metadata would be stored and served and the previous "what is data, what is meta" discussion.

Example: we have one vector of buildings footprints. Each item (geom) has some type of tax code, or other fields related to taxonomy, e.g. "building material". Now, following an import approach based on one standard (e.g GED4all), it would mean to prepare the dataset by translating all taxonomy-related info into one string them into that standard - we already see this is not sustainable; plus, in a scenario where we store vector data out-of-DB, the in-DB metadata could only show aggregate list of all unique tax codes used in the file (ie no DB selection of individual records based on taxonomy string criteria).

So we could:

keep any taxonomy-related information as originally provided, only in the vector dataset (storage);
remove the ged4all.asset.taxonomy field from schema; keep only ged4all.exposure_model.taxonomy_source (link to appropriate online docs)
use instead ged4all.tags as flexible way to identify all those fields that are taxonomy-related (one or more) when present: tag.tax_material, tag.tax_storeys, etc; the metadata extraction would list unique values for each tag.
we keep the documentation on GED4all most relevant taxonomy fields for data creators who are willing to fit their data with one suggested standard

So that means there is no interpretation of strings' meaning (or building of appropriate string) during the data curation or the import process; only quick identification of those fields that are related to taxonomy. The user that downloads the dataset will lookup the taxonomy_source information and use them to interprate the list of tax fields if required.

stufraser1 commented 3 years ago

On the point that the discussion applies only to vector footprints data where taxonomy is known, whereas many datasets are void of taxonomy and defined more broadly as e.g. residential: Its true that this is how many datasets are defined now, but I believe we should keep the 'tighter'/more comprehensive definition in the taxonomy field and keep it mandatory because: 1) we are trying to improve the definition of data, including encourage more detailed and granular definition of exposure characteristics in the data [while the category fields provides general information for grouping datasets via metadata, and needed especially when data is stored out-of-DB], 2) the current string taxonomy is used to relate V curve to asset types, which facilitates a search of compatible V curves for given type of asset data, 3) the more comprehensive taxonomy allows to record data as residential and all other fields as unknown, thus reflecting the original data definition but which much more clarity of what characteristics are known and unknown.

I agree it can be a significant overhead/friction to convert data into the string taxonomy. We need to make this component as easy as possible through tooling and worked/most common examples for development data, while making clear that the string can support much more detail. Connected to this, something we've considered before is how human readable the GEM taxonomy string is (not very); So having a way to provide that as easy to read attributes (and allow search via those terms) is key - GEM tried to address this with OQ taxonomy generator. A middle ground could be to apply a less cryptic / human-readable string of characteristics using words instead of codes, but this requires to define our own taxonomy and would be less compatible with OSM.

I'm not a fan of putting attributes in 'tags' - I think these are interpreted as much more optional / generic / additional fields and not core to the comprehensive definition of exposure data. Having taxonomy in one string in one field is also used for the varied infrastructure taxonomies we look to implement too, without the number of occupancy/construction fields (or tags) proliferating to large numbers.

matamadio commented 3 years ago

I got your point about pushing data creators to add the taxonomy string whenever possible; still, we have (and will have) many pieces of exposure data that don't have any tax string, or where string is not applicable. If the tax string is mandatory, how those datasets could be added? This is a bit of urgent question for adding showcase datasets.

This somehow relates to GFDRR/rdl-standard#13 as well: what is metadata and what is data (= what is indexed and what is not).

stufraser1 commented 3 years ago

We need to convert to the taxonomy the best we can. GEM building taxonomy offers the option to use a 'full taxonomy', 'short taxonomy', or a balance: 'taxonomy with unknowns omitted' A residential single storey unreinforced masonry building (all other attributes unknown) would be MUR/RES+RES1 in short tax and DX/MUR//DY/MUR//RES+RES1/////// in tax with unknowns omitted. Full tax would be DX+D99/MUR+MUN99+MO99/L99/DY+D99/MUR+MUN99+MO99/L99/H99/Y99/RES+RES1/BP99/PLF99/IR99/EW99/RSH99+RMT99+R99+RWC99/F99+FWC99/FOS99 It has been proposed in the past to use short tax. I think we continue with that - it serves the needs of the data we handle, while offering extension if data improves in long term to include further terms.

We generally know in a vector building dataset whether a dataset is 100% residential, 100% commercial, etc, or in more granular dataset we have an occupancy given for each building. For these we can use the taxonomy generator TAXTWEB, or this very basic excel lookup of only the main characteristics, that we generally see https://docs.google.com/spreadsheets/d/1Z8sVBr_-MCxhhuywu5vfTK1z4MBTLbI-/edit#gid=2116605931 This is an example and should be replaced with a coded tool later (possible to do this with the IDF open exposure mapping eventually; https://github.com/OasisLMF/OasisDataConverter; I am working on development of that in another contract). For now though, this tool will help us to gradually build-up a lookup table of the characteristics we most often encounter, to assist data conversion into RDL

For aggregated building distribution datasets (as raster, and e.g. R5) it is more problematic. These cases often give a total replacement cost as the raster value. In the past we have received a separate document providing the occupancy breakdown which has been dependent on dominant landuse class per cell. One landuse class might have 80% residential single storey, versus 90% residential high-rise, or 60/70 split of commercial and industrial. For these datasets, I think the only option is to record the taxonomy as mixed used ('MIX', in the short taxonomy) in the taxonomy string, and provide a link to the supporting document in the data record.

matamadio commented 3 years ago

Let's take a practical example with the exposure layers from disasterrisk.af to see how that works.

There are 55 rasters and 73 vectors.

Of 55 rasters, 29 refer explicitly to buildings and have names containing a couple of info that can be somehow related to some main TAX attributes: stone, clay, rural, urban (which is not strictly a TAX category), residential, non residential (but occupancy attribute has already its own common field in the schema).

All non-residential buildings (People-USD) Clay Rural Structure (USD) Clay Urban Structure (USD) Non residential (USD) Non-residential school buildings (People) Non-residential school buildings (USD) Rural Buildings (count) Stone Rural Structure (US$) Stone Urban Structure (US$) Urban Building (count) 19 types of residential capital stock

Others refer to general infrastructures ("roads", "airports") and have no info about the taxonomy, nor it can be inferred. But the TAX tool does not cover these anyway (GED4all does).

Of 73 vectors, majority represent aggregated values of residential and non residential capital stock, and do not include any taxonomy info in the table, apart from the very specific stock code (not covered by taxonomy).

Only 2 vector layers actually represent building footprints:

OSM buildings at national scale; the table preview promises some TAX fields but in reality the file downloaded has none, and only covers a fraction of the country. Then I downloaded the full updated dataset from geofabrik OSM which covers the whole country, but still misses any TAX information
OSM buildings for Kabul; has "type" which however doesn't seem complete or consistent (most values are "yes")

TL;DR For AFG exposure dataset, the GEM building taxonomy (via TAXIWEB tool) could be applied "properly" (individual footprints) to 2 layers over 128, but there are no information included in the original OSM file to actually do that. A short taxonomy string covering 1) material and 2) occupancy (already covered by exp schema attribute) can be attributed as unique value to whole layer for some of the aggregated buildings layers (grid); rural-non rural categories don't seem to be accounted in the TAXIWEB tool.

That's why I think we need a less strict approach to the taxonomy, or to just avoid mentioning it while investigating more about other standard-based approaches.

stufraser1 commented 3 years ago

Where no information is provided, we can define unknown usage in the taxonomy. We need to be able to decide which. Agree rural-urban is not accounted for in the tax, and it is probably one case we can't resolve. But our processes and documentation of proposed Short Taxonomy strings for buildings and for other infrastructure should guide users to use a defined taxonomy for it to be a standard, rather than allow multiple definitions. By increasing the no. of datasets with more comprehensive definition of attributes, we improve exposure data as a whole. The proposed taxonomy handles most cases we come across; we just need to define interoperability tools to help get existing data into that form.

Where exposure is defined as urban or rural, and used in modelling, we should really be pushing for a taxonomy string to be defined as well.

stufraser1 commented 1 year ago

Specific guidance and examples on adding occ/con taxonomies to be added to new version of docs under exposure section

pzwsk commented 10 months ago

Should we close that one?