Closed matamadio closed 1 year ago
Moving to standard repo.
My suggestion is to use an open codelist for the field imt
.
See https://standard.open-contracting.org/infrastructure/latest/en/reference/codelists/
This issue was anticipated and discussed at length in defining IMTs - however we perhaps did not anticipate the issue becoming such a problem so soon. The choice was made to define a list of most common IMTs to ensure vulnerability and hazard could match. With the single common reference to IMT the potential mismatch in codes becomes less of an issue, but we should be aware we may find more hazard data than anticipated having no matching V curve. In the envisaged function of matching E/H/V curves that may be compliant (i.e. use the same asset/hazard coding) this will become clear, but we may also want to communicate the availability of V curve for the same type of hazards, which do not exactly match the IMT (people can then choose to apply a conversion if applicable).
Exactly, in some case a V function could be applied to an hazard even if IMT does not match, using unit conversions. Then IMT should not be the attribute that links the V and E schema; rather hazard type/process.
@odscrachel @duncandewhurst advice here please, and lets include any change in this next version of schema
Nice to see the Open Contracting for Infrastructure Data Standard make an appearance here!
This sounds like a scenario where an open (representative) codelist would make more sense than a closed (comprehensive) codelist. If using an open codelist for intensity measure, I agree with @matamadio that it should not be the attribute on which to link vulnerability and event schemas. If the codelists for hazard type and process are closed (i.e. publishers can't add their own values), then linking based on those sounds good.
Propose to use an open list. FIltering on hazard and vulnerability that align can be filtered on hazard and process type rather than imt. In the same way that they are filtered on occupancy category not full taxonomy string.
Propose to use an open list. FIltering on hazard nad vulnerability that align can be filtered on hazard and process type rather than imt. In the same way that they are filtered on occupancy category not full taxonomy string.
I understand the motivation for this approach, but this further complicates the problem of finding appropriate vulnerability models for hazard + exposure. If I have earthquake hazard data with IMT PGA (g) that does not mean I can easily use a vulnerability curve using an IMT SA(0.3). Hazard and process type are necessary but not sufficient to select vulnerability models.
Hazard and process type are necessary but not sufficient to select vulnerability models. True - Suggestion to filter on hazard type and process would present a wide selection of curves for user to select from and put the task of imt selection on them. Providing an optional selection on
imt
may suffice?
So we have the following options?
imt
codelist with tooling that filters on hazard type and hazard process.
This would show all vulnerability/fragility curves for a hazard type/process so user would see curves with multiple imt
e.g. PGA(g), SA(0.3), etc.
In this case we're giving users a wide range of results that they have to sift themselves. This may be an issue with lots of curves in a catalog, so user may need an additional optional filter on imt
to address https://github.com/GFDRR/rdl-standard/issues/5#issuecomment-822301000. The big concern previously discussed in the project with @pslh and others is that we have a complicated codelist combining a label and unit which could be written in any number of ways . Data providers could reference seismic data with pga metric as 'PGA(g)', 'PGA-g', 'pga_g' etc, etc, etc., which is why we defined the codelist in the first place.
We've tried to define the vast majority of imt
in that codelist but admit that there can be others - and snow avalanche wasn't one that the original project team really looked at so while that was a gap, for FL, EQ, TS etc the codelist should be much more complete.
imt
codelist with tooling that filters on hazard type, hazard process and imt
.
This would show users to select directly a smaller set of vulnerability/fragility curves appropriate for using with a given hazard dataset than in option 1 (showing curves for the selected hazard imt only).
imt
filter in tooling could still be optional so users can choose to see a broader range of curves if they want to.In either option, the onus would be on us to add imt
codes as we uncover new ones in the 'data upload sprint' and to do targeted research to get as close as possible to 100%, and guide users on how to structure new ones (it is likely naive to expect all users will follow this, but admins should follow it)
If we're doing this anyway to prevent later problems, do we just commit to this approach and use a closed list?
Option 2 makes more sense to me; it is true that you can't use V model with different imt than hazard; sometimes this can be easily solved by simple hazard metric unit conversion (e.g. cm to m), other times this is not possible. Giving the user the larger range of datasets applicable - with optional imt filtering - seems the safest option. But this is more related to the consuming application (filtering and sorting), rather than the standard itself.
If you aren't confident that the IMT codelist is comprehensive, I would recommend going with an open codelist. Tooling can still implement a filter based on an open codelist and, as you suggest, guidance can be provided to publishers on how to structure new codes and on how to flag them to be considered for addition to the standardised codelist.
The problem with a closed codelist that isn't comprehensive is that if a publisher has a dataset that uses an IMT that isn't in the codelist, they have no option other than to omit the field, otherwise their RDLS metadata will be invalid. That means there will be some datasets without an IMT so users filtering on IMT will either miss those or have to actually open the datasets for which no IMT is listed in order to determine whether they are relevant.
Edit: correct typo
If you aren't confident that the IMT codelist is comprehensive, I would recommend going with an open codelist. [...]
The problem with a closed codelist that isn't comprehensive is that if a publisher has a dataset that uses an IMT that isn't in the codelist, they have no option other than to omit the field, otherwise their RDLS metadata will be invalid. That means there will be some datasets without an IMT so users filtering on IMT will either miss those or have to actually open the datasets for which no IMT is listed in order to determine whether they are relevant.
In the event that someone wished to contribute a dataset for which the existing IMT codelist was really not applicable I would prefer to have a new IMT value that at least gives us some clue as to what they think they need rather than the alternatives (NULL IMT, a random choice of wrong IMT or no contribution). I think this means I agree with the open codelist approach. I think this also means we are going to have to think about how we help the community match e.g. hazard to vulnerability: I wonder if Tiziana and others have ideas, guidance for this problem.
Edit by @duncandewhurst: correcting a typo in the quote. Looks like Paul understood what I meant anyway :-)
Let us continue with the open codelist approach then. Noted to canvas views on guiding users to match hazard and vulnerability. Moving to agreed and ready
did you consider splitting im code into intensity measure and unit as separate fields? It seems to me there is a natural division there which would give people the freedom to specify an alternative unit for the same IM without having to make up a new code.
did you consider splitting im code into intensity measure and unit as separate fields? It seems to me there is a natural division there which would give people the freedom to specify an alternative unit for the same IM without having to make up a new code.
There was some discussion around this, I think at the time we were concerned about e.g. matching a vuln curve in m/s2 (or ft/s2 or whatever) to hazard in g, and there being existing contributions in both. The Unified Challenge Fund DB requires users to select from a (fixed) list of IMT codes, but the common IMT table does indeed separate im_code from units:
process_code | hazard_code | im_code | description | units
--------------+-------------+-----------------+--------------------------------------------------------------+-------
QGM | EQ | PGA:g | Peak ground acceleration in g | g
QGM | EQ | PGA:m/s2 | Peak ground acceleration in m/s2 (meters per second squared) | m/s2
...
QGM | EQ | SA(0.2):g | Spectral acceleration with 0.2s period | g
QGM | EQ | SA(0.2):m/s2 | Spectral acceleration with 0.2s period | m/s2
This seems to me to be once again a case of "vulnerability matching" vs "flexibility in contribution"; if we are prioritizing ease of contribution perhaps splitting units makes sense.
No objections to splitting out the unit from my perspective. @johcarter's point is a good one!
The thought occurred when I was trying to fit PiWind into the hazard data json format, and when describing the resource file realised that windspeed was measured in knots not km/h, and there was no suitable im_code from the list.
"resources": [
{
"name": "footprint.bin.z",
"path": "footprint.bin.z",
"title": "Event footprint file",
"mediatype": "custom binary",
"imt": "v_ect(1m):km/h",
"data_uncertainty": ""
},
I would support splitting out unit into a separate field to add flexibility.
to summarise it seems like we're agreeing on this suggestion:
Title | Field name | Description | Type | codelist |
---|---|---|---|---|
Intensity measure | intensity_measure |
The measurement used in the dataset. This is typically a measurement of intensity but can also take other forms, e.g. spectral velocity or flood water depth. | object | |
Measurement | measurement |
The type of measurement. | string | im_code from IMT.csv |
Intensity unit | unit |
The unit of measurement. | string | unit from measurement_units.csv |
And we update IMT.csv
to e.g.
im_code | label | definition |
---|---|---|
PGA | Peak ground acceleration | The maximum ground acceleration that occurred during earthquake shaking at the location. |
SA | Spectral acceleration | The maximum acceleration in an earthquake on an object – specifically a damped, harmonic oscillator moving in one physical dimension. |
(I've taken the definitions from wikipedia, we'll need to work on definitions for all the codes as IMT.csv is currently missing them.)
And create measurement_units.csv
e.g.
unit | label | definition |
---|---|---|
g | Acceleration due to gravity | The acceleration due to Earth's gravity. |
m/s2 | Meters per second squared | Acceleration in S I units, meters per second squared. |
The only potential problem I see is that only certain units are appropriate for certain measurements which will make this object tricky to validate.
Why are im_code
and unit
not in the same table? Then we could include the suitable combinations and they could be validated from that?
Discussed this with @stufraser1 and agreed on the following:
Title | Field name | Description | Type | codelist |
---|---|---|---|---|
Intensity measure | intensity_measure |
The measurement used in the dataset. This is typically a measurement of intensity but can also take other forms, e.g. spectral velocity or flood water depth. | string | IMT.csv |
With IMT.csv being updated to have both the metric and unit in separate columns, e.g.
code | metric | unit | title | definition |
---|---|---|---|---|
PGA:g | PGA | g | Peak ground acceleration | The maximum ground acceleration that occurred during earthquake shaking at the location in g. |
PGA:m/s2 | PGA | m/s2 | Peak ground acceleration | The maximum ground acceleration that occurred during earthquake shaking at the location in meters per second squared. |
fl_wd:m | fl_wd | m | Flood water depth | The maximum depth of flood waters in meters. |
This will be an open codelist and the documentation shall make clear that if a user needs to create their own code they should use a metric and a unit from the existing codelist as appropriate and follow the same pattern, i.e. 'metric:unit'
We have a big list of possible imt (intensity measures) for hazard processes. However, we already have some datasets not compliant:
http://jkan.riskdatalibrary.org/datasets/hzd-afg-ls-lav/ This one measures Snow Avalanche pressures in kilo Pascal (kPa) (I had to put Debris-flow intensity index instead)
http://jkan.riskdatalibrary.org/datasets/hzd-afg-dr/ This one measures water availability as percentage over total demand (I had to put SPI instead)
In general, I suggest to think of some other approach for this; trying to anticipate all possible unit of measure looks utopistic.