[Proposal] Resource file for clustering/seasonality

GFDRR / rdl-standard

The Risk Data Library Standard (RDLS) is an open data standard to make it easier to work with disaster and climate risk data. It provides a common description of the data used and produced in risk assessments, including hazard, exposure, vulnerability, and modelled loss, or impact, data.

https://docs.riskdatalibrary.org/

Creative Commons Attribution Share Alike 4.0 International

16 stars 1 forks source link

[Proposal] Resource file for clustering/seasonality #81

Closed stufraser1 closed 1 year ago

stufraser1 commented 1 year ago

Consider adding metadata to describe object that describes seasonality/clustering of events

Important in event frequency distributions is seasonality and clustering of multiple events in time, which the return period / event rate info does not capture. One of my suggestions for capturing this in the upcoming ODS/RDL alignment project I am working on with Stu and co, will be an extra resource file which is a list of event occurrences across a span of years. This captures the seasonality and clustering aspect of event frequency within each year. Also, stochastic event catalogues in cat models are too large to be listed in meta-data.

Originally posted by @johcarter in https://github.com/GFDRR/rdl-standard/issues/59#issuecomment-1559690009

johcarter commented 1 year ago

Attached is the standard format of an 'Occurrence' file in ODS which specifies a list of event occurrences with assigned Period (an integer representing a year) and date fields.

ODS Field names EventId, Period, Year, Month, Day

There may be more than one resource file representing different scenarios of event frequency/clustering/seasonality per event set. Therefore an id and description field for each resource file would be useful in meta data.

The total number of periods per occurrence file is also needed in meta data in order to derive loss metrics. This is because periods with no event occurrences will not appear in the file and the overall range of periods covered is not clear. In ODS this is a meta data field called 'NumberOfPeriods'

E.g. For stochastic event sets, for Period in the range 1 to 10000, then NumberOfPeriods = 10000 For historical event sets, for Period in the range 1951 to 2000, NumberOfPeriods = 50 (Period range for historical event sets may also be 1 to 50 with the 'Year' field holding the real year, what matters is the correct span of years is represented for annual loss metrics) occurrence_lt.csv

stufraser1 commented 1 year ago

This is covered already by event_set.time.span :

Title	Field name	Description	Type
Event set time	event_set.time	The modelled scenario may have a known start date, end date, duration, or reference year to which it refers. In some cases, not all of these fields will have known or relevant values.	object
Event set start time	event_set.time.start	The earliest event start time covered by the modelled scenario(s) contained in the event set.	date-time
Event set end time	event_set.time.end	The latest event end time covered by the modelled scenario(s) contained in the event set.	date-time
Event set time span	event_set.time.span	The time period covered by the modelled scenario(s) included in the event set.	string
Event set reference year	event_set.time.year	A general reference year to which the modelled scenario(s) refers (e.g. '2050').	string

Valid question whether event_set.time.span should be renamed event_set.time.period :

Title	Field name	Description	Type
Event set time period	event_set.time.period	The time period covered by the modelled scenario(s) included in the event set.	string

johcarter commented 1 year ago

Yes this would work and I'm indifferent to time.span versus time.period.

The ODS 'NumberOfPeriods' would go into time.span, and for a stochastic event set, the time.year could be 1 indicating the earliest Period in the occurrence file.

duncandewhurst commented 1 year ago

To align with https://github.com/GFDRR/rdl-standard/issues/54, https://github.com/GFDRR/rdl-standard/issues/67 and DCAT, I think the field should be named 'temporal'. If possible, we should reuse the modelling too, although we can add fields if needed.

for a stochastic event set, the time.year could be 1 indicating the earliest Period in the occurrence file.

Is this the case where the earliest period in the occurrence file is actually 1AD? If not, what does '1' represent?

A couple more questions:

In occurrence_lt.csv, is the year column intended to represent a calendar year (e.g. 2023) and it is populated with 1, 2, 3 etc. because it is dummy data? Or is this what stochastic data actually looks like, i.e. the year column is populated with a count of years without reference to an actual calendar year?
What is the distinction between event_set.time.span (The time period covered by the modelled scenario(s) included in the event set.) and event_set.time.start and event_set.time.end (The earliest event start time and latest event end time in the event set)? It seems like they are semantically equivalent.
Should event_set.time.year be an array of years to allow for event sets that span more than one calendar year?

johcarter commented 1 year ago

| Is this the case where the earliest period in the occurrence file is actually 1AD? If not, what does '1' represent?

For stochastic event sets, each period represents a possible sequence of events representing the near term risk, i.e. what could happen over the next year . Its therefore not appropriate to relate them to a historical date, or to start from todays date and extend into the future. And there can be hundreds of thousands of years covered. The time span is needed to specify the total number of periods covered in order to calculate relative frequency/likelihood for outputs, but the start period is simply

yes Year is intended for real calendar year (and does not have one in this case) whereas Period is the index in the range 1 to N.
time.span is needed when it is not appropriate to assign real dates to the period as in the case of a stochastic event set. otherwise time.span can indeed be derived from time.end - time.start although that might not be an integer whereas the number of periods in ODS is an integer.
In my view it is useful as currently described and can't see a case for it being an array.

johcarter commented 1 year ago

Here is an example for historical Cyclones in Bangladesh since 1991, which does have real calendar dates

hc_oasis_occurrence.csv

duncandewhurst commented 1 year ago

Thanks for the clarifications!

Based on that, we can reuse the modelling proposed in #67. However, I would replace span with duration and use the ISO8601 duration formation, e.g. P50Y for a stochastic event set covering 50 years without reference to a specific calendar dates.

3. In my view it is useful as currently described and can't see a case for it being an array.

How would you populate year for the Bangladesh example, which covers 1991 to 2019?

Follow up question on the Bangladesh example to make sure I'm understanding things correctly: Why do the early rows conform to Period being an integer representing a year (per https://github.com/GFDRR/rdl-standard/issues/81#issuecomment-1583332508) but later rows don't?

PERIOD_NO	OCC_YEAR
1	1991
5	1995
7	1997
17	2007
17	2007
18	2008
19	2009

PERIOD_NO	OCC_YEAR
87	2019
88	1991
92	1995
94	1997
104	2007

johcarter commented 1 year ago

How would you populate year for the Bangladesh example, which covers 1991 to 2019?

I would use the values 1991 to 2019 in the Year field, but 1 to 29 in the Period field

Follow up question on the Bangladesh example to make sure I'm understanding things correctly: Why do the early rows conform to Period being an integer representing a year (per https://github.com/GFDRR/rdl-standard/issues/81#issuecomment-1583332508) but later rows don't?

(Sorry for the formatting, I don't know all the shortcuts). Sorry the file provided was a bad example, it is a historical ensemble which is an extended set of scenarios of how the historical events might have played out differently had they started in different sea conditions (9 different versions of each). Hence we turned a 29 year historical period into 261 years to explore those different potential outcomes. A purely historical event occurrence set would look how you would expect, and Period starts at 1 and ends at 29. Please see attached for example. hc_oasis_occurrence_historical.csv

odscjen commented 1 year ago

I would use the values 1991 to 2019 in the Year field, but 1 to 29 in the Period field

It wouldn't be appropriate to use year in this case as there isn't a single reference year in the event_set.

Based on the suggestion in https://github.com/GFDRR/rdl-standard/issues/81#issuecomment-1598049005 (and incorporating other accepted changes from other issues) this would actually be:

"event_set" {
  "temporal": {
    "start": "1991",
    "end": "2019",
    "duration": "P29Y"
    }
  }

For a stochastic event_set with no 'real' dates duration would be the only field used from temporal.

@johcarter @stufraser1 are you both happy with using event_set.temporal as detailed in https://github.com/GFDRR/rdl-standard/issues/67#issuecomment-1596845345 to address this at the event_set level, noting that this same object also appears in resources.temporal to cover the period information for individual resources addressing

There may be more than one resource file representing different scenarios of event frequency/clustering/seasonality per event set. Therefore an id and description field for each resource file would be useful in meta data.