GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.
https://docs.riskdatalibrary.org/
Creative Commons Attribution Share Alike 4.0 International
16 stars 1 forks source link

No matter how many imt we propose, the datasets will use a different one #5

Closed matamadio closed 1 year ago

matamadio commented 3 years ago

We have a big list of possible imt (intensity measures) for hazard processes. However, we already have some datasets not compliant:

http://jkan.riskdatalibrary.org/datasets/hzd-afg-ls-lav/ This one measures Snow Avalanche pressures in kilo Pascal (kPa) (I had to put Debris-flow intensity index instead)

http://jkan.riskdatalibrary.org/datasets/hzd-afg-dr/ This one measures water availability as percentage over total demand (I had to put SPI instead)

In general, I suggest to think of some other approach for this; trying to anticipate all possible unit of measure looks utopistic.

pzwsk commented 3 years ago

Moving to standard repo.

pzwsk commented 3 years ago

My suggestion is to use an open codelist for the field imt.

See https://standard.open-contracting.org/infrastructure/latest/en/reference/codelists/

stufraser1 commented 3 years ago

This issue was anticipated and discussed at length in defining IMTs - however we perhaps did not anticipate the issue becoming such a problem so soon. The choice was made to define a list of most common IMTs to ensure vulnerability and hazard could match. With the single common reference to IMT the potential mismatch in codes becomes less of an issue, but we should be aware we may find more hazard data than anticipated having no matching V curve. In the envisaged function of matching E/H/V curves that may be compliant (i.e. use the same asset/hazard coding) this will become clear, but we may also want to communicate the availability of V curve for the same type of hazards, which do not exactly match the IMT (people can then choose to apply a conversion if applicable).

matamadio commented 3 years ago

Exactly, in some case a V function could be applied to an hazard even if IMT does not match, using unit conversions. Then IMT should not be the attribute that links the V and E schema; rather hazard type/process.

stufraser1 commented 1 year ago

@odscrachel @duncandewhurst advice here please, and lets include any change in this next version of schema

duncandewhurst commented 1 year ago

Nice to see the Open Contracting for Infrastructure Data Standard make an appearance here!

This sounds like a scenario where an open (representative) codelist would make more sense than a closed (comprehensive) codelist. If using an open codelist for intensity measure, I agree with @matamadio that it should not be the attribute on which to link vulnerability and event schemas. If the codelists for hazard type and process are closed (i.e. publishers can't add their own values), then linking based on those sounds good.

stufraser1 commented 1 year ago

Propose to use an open list. FIltering on hazard and vulnerability that align can be filtered on hazard and process type rather than imt. In the same way that they are filtered on occupancy category not full taxonomy string.

pslh commented 1 year ago

Propose to use an open list. FIltering on hazard nad vulnerability that align can be filtered on hazard and process type rather than imt. In the same way that they are filtered on occupancy category not full taxonomy string.

I understand the motivation for this approach, but this further complicates the problem of finding appropriate vulnerability models for hazard + exposure. If I have earthquake hazard data with IMT PGA (g) that does not mean I can easily use a vulnerability curve using an IMT SA(0.3). Hazard and process type are necessary but not sufficient to select vulnerability models.

stufraser1 commented 1 year ago

Hazard and process type are necessary but not sufficient to select vulnerability models. True - Suggestion to filter on hazard type and process would present a wide selection of curves for user to select from and put the task of imt selection on them. Providing an optional selection on imt may suffice?

So we have the following options?

  1. Allow open imt codelist with tooling that filters on hazard type and hazard process. This would show all vulnerability/fragility curves for a hazard type/process so user would see curves with multiple imt e.g. PGA(g), SA(0.3), etc. In this case we're giving users a wide range of results that they have to sift themselves. This may be an issue with lots of curves in a catalog, so user may need an additional optional filter on imt to address https://github.com/GFDRR/rdl-standard/issues/5#issuecomment-822301000.

The big concern previously discussed in the project with @pslh and others is that we have a complicated codelist combining a label and unit which could be written in any number of ways . Data providers could reference seismic data with pga metric as 'PGA(g)', 'PGA-g', 'pga_g' etc, etc, etc., which is why we defined the codelist in the first place. We've tried to define the vast majority of imt in that codelist but admit that there can be others - and snow avalanche wasn't one that the original project team really looked at so while that was a gap, for FL, EQ, TS etc the codelist should be much more complete.

  1. Define closed imt codelist with tooling that filters on hazard type, hazard process and imt. This would show users to select directly a smaller set of vulnerability/fragility curves appropriate for using with a given hazard dataset than in option 1 (showing curves for the selected hazard imt only). imt filter in tooling could still be optional so users can choose to see a broader range of curves if they want to.

In either option, the onus would be on us to add imt codes as we uncover new ones in the 'data upload sprint' and to do targeted research to get as close as possible to 100%, and guide users on how to structure new ones (it is likely naive to expect all users will follow this, but admins should follow it) If we're doing this anyway to prevent later problems, do we just commit to this approach and use a closed list?

matamadio commented 1 year ago

Option 2 makes more sense to me; it is true that you can't use V model with different imt than hazard; sometimes this can be easily solved by simple hazard metric unit conversion (e.g. cm to m), other times this is not possible. Giving the user the larger range of datasets applicable - with optional imt filtering - seems the safest option. But this is more related to the consuming application (filtering and sorting), rather than the standard itself.

duncandewhurst commented 1 year ago

If you aren't confident that the IMT codelist is comprehensive, I would recommend going with an open codelist. Tooling can still implement a filter based on an open codelist and, as you suggest, guidance can be provided to publishers on how to structure new codes and on how to flag them to be considered for addition to the standardised codelist.

The problem with a closed codelist that isn't comprehensive is that if a publisher has a dataset that uses an IMT that isn't in the codelist, they have no option other than to omit the field, otherwise their RDLS metadata will be invalid. That means there will be some datasets without an IMT so users filtering on IMT will either miss those or have to actually open the datasets for which no IMT is listed in order to determine whether they are relevant.

Edit: correct typo

pslh commented 1 year ago

If you aren't confident that the IMT codelist is comprehensive, I would recommend going with an open codelist. [...]

The problem with a closed codelist that isn't comprehensive is that if a publisher has a dataset that uses an IMT that isn't in the codelist, they have no option other than to omit the field, otherwise their RDLS metadata will be invalid. That means there will be some datasets without an IMT so users filtering on IMT will either miss those or have to actually open the datasets for which no IMT is listed in order to determine whether they are relevant.

In the event that someone wished to contribute a dataset for which the existing IMT codelist was really not applicable I would prefer to have a new IMT value that at least gives us some clue as to what they think they need rather than the alternatives (NULL IMT, a random choice of wrong IMT or no contribution). I think this means I agree with the open codelist approach. I think this also means we are going to have to think about how we help the community match e.g. hazard to vulnerability: I wonder if Tiziana and others have ideas, guidance for this problem.

Edit by @duncandewhurst: correcting a typo in the quote. Looks like Paul understood what I meant anyway :-)

stufraser1 commented 1 year ago

Let us continue with the open codelist approach then. Noted to canvas views on guiding users to match hazard and vulnerability. Moving to agreed and ready

johcarter commented 1 year ago

did you consider splitting im code into intensity measure and unit as separate fields? It seems to me there is a natural division there which would give people the freedom to specify an alternative unit for the same IM without having to make up a new code.

pslh commented 1 year ago

did you consider splitting im code into intensity measure and unit as separate fields? It seems to me there is a natural division there which would give people the freedom to specify an alternative unit for the same IM without having to make up a new code.

There was some discussion around this, I think at the time we were concerned about e.g. matching a vuln curve in m/s2 (or ft/s2 or whatever) to hazard in g, and there being existing contributions in both. The Unified Challenge Fund DB requires users to select from a (fixed) list of IMT codes, but the common IMT table does indeed separate im_code from units:

 process_code | hazard_code |     im_code     |                         description                          | units 
--------------+-------------+-----------------+--------------------------------------------------------------+-------
 QGM          | EQ          | PGA:g           | Peak ground acceleration in g                                | g
 QGM          | EQ          | PGA:m/s2        | Peak ground acceleration in m/s2 (meters per second squared) | m/s2
... 
 QGM          | EQ          | SA(0.2):g       | Spectral acceleration with 0.2s period                       | g
 QGM          | EQ          | SA(0.2):m/s2    | Spectral acceleration with 0.2s period                       | m/s2

This seems to me to be once again a case of "vulnerability matching" vs "flexibility in contribution"; if we are prioritizing ease of contribution perhaps splitting units makes sense.

duncandewhurst commented 1 year ago

No objections to splitting out the unit from my perspective. @johcarter's point is a good one!

johcarter commented 1 year ago

The thought occurred when I was trying to fit PiWind into the hazard data json format, and when describing the resource file realised that windspeed was measured in knots not km/h, and there was no suitable im_code from the list.

  "resources": [
    {
      "name": "footprint.bin.z",
      "path": "footprint.bin.z",
      "title": "Event footprint file",
      "mediatype": "custom binary",
      "imt": "v_ect(1m):km/h",
      "data_uncertainty": ""
    },

I would support splitting out unit into a separate field to add flexibility.

odscjen commented 1 year ago

to summarise it seems like we're agreeing on this suggestion:

Title Field name Description Type codelist
Intensity measure intensity_measure The measurement used in the dataset. This is typically a measurement of intensity but can also take other forms, e.g. spectral velocity or flood water depth. object
Measurement measurement The type of measurement. string im_code from IMT.csv
Intensity unit unit The unit of measurement. string unit from measurement_units.csv

And we update IMT.csv to e.g.

im_code label definition
PGA Peak ground acceleration The maximum ground acceleration that occurred during earthquake shaking at the location.
SA Spectral acceleration The maximum acceleration in an earthquake on an object – specifically a damped, harmonic oscillator moving in one physical dimension.

(I've taken the definitions from wikipedia, we'll need to work on definitions for all the codes as IMT.csv is currently missing them.)

And create measurement_units.csv e.g.

unit label definition
g Acceleration due to gravity The acceleration due to Earth's gravity.
m/s2 Meters per second squared Acceleration in S I units, meters per second squared.

The only potential problem I see is that only certain units are appropriate for certain measurements which will make this object tricky to validate.

stufraser1 commented 1 year ago

Why are im_code and unit not in the same table? Then we could include the suitable combinations and they could be validated from that?

odscjen commented 1 year ago

Discussed this with @stufraser1 and agreed on the following:

Title Field name Description Type codelist
Intensity measure intensity_measure The measurement used in the dataset. This is typically a measurement of intensity but can also take other forms, e.g. spectral velocity or flood water depth. string IMT.csv

With IMT.csv being updated to have both the metric and unit in separate columns, e.g.

code metric unit title definition
PGA:g PGA g Peak ground acceleration The maximum ground acceleration that occurred during earthquake shaking at the location in g.
PGA:m/s2 PGA m/s2 Peak ground acceleration The maximum ground acceleration that occurred during earthquake shaking at the location in meters per second squared.
fl_wd:m fl_wd m Flood water depth The maximum depth of flood waters in meters.

This will be an open codelist and the documentation shall make clear that if a user needs to create their own code they should use a metric and a unit from the existing codelist as appropriate and follow the same pattern, i.e. 'metric:unit'