cf-convention / vocabularies

Issues and source files for CF controlled vocabularies
3 stars 1 forks source link

Standard names: *_threshold, allow for percentile based thresholds #19

Open zklaus opened 3 years ago

zklaus commented 3 years ago

Introduction

This issue describes a proposed change to the description text of existing, threshold-based standard names. It is the result of a number of discussions, most recently at the 2021 CF Conventions, climate index breakout group.

To allow for concrete discussions, the proposed change is first discussed as a concrete example. As such, it is based on the following current definition.

Changelog

This changelog is intended to allow for quickly catching up. If you are new to the issue or are coming back to it after some time, this summary should give you the most important information and you need to start reading only after the last comment mentioned in the following table.

Please let me know if you feel the table does not reflect the consensus appropriately!

Date of update Discussion up to
and including
Main changes
2021-10-11 https://github.com/cf-convention/vocabularies/issues/19 Replace percentile(_of_X) with cumulative_probability_of_X

Current Definition

number_of_days_with_air_temperature_above_threshold

Air temperature is the bulk temperature of the air, not the surface (skin) temperature. A variable whose standard name has the form number_of_days_with_X_below|above_threshold is a count of the number of days on which the condition X_below|above_threshold is satisfied. It must have a coordinate variable or scalar coordinate variable with the standard name of X to supply the threshold(s). It must have a climatological time variable, and a cell_methods entry for within days which describes the processing of quantity X before the threshold is applied. A number_of_days is an extensive quantity in time, and the cell_methods entry for over days should be "sum".

Proposed Definition

In the following proposed definition, the first paragraph is unchanged except for the removal of the sentence about the threshold coordinate variable, which is in its modified form in the second paragraph.

number_of_days_with_air_temperature_above_threshold

Air temperature is the bulk temperature of the air, not the surface (skin) temperature. A variable whose standard name has the form number_of_days_with_X_below|above_threshold is a count of the number of days on which the condition X_below|above_threshold is satisfied. It must have a climatological time variable, and a cell_methods entry for within days which describes the processing of quantity X before the threshold is applied. A number_of_days is an extensive quantity in time, and the cell_methods entry for over days should be "sum".

It must give information about the threshold in one or both of the following two ways. With an explicit threshold in a coordinate variable or scalar coordinate variable with the standard name of X, or with a percentile threshold given in a scalar coordinate variable with the standard name cumulative_probability_of_X.

Implied Changes

The proposed definition given above requires the addition of new standard names, cumulative_probability_of_X. The proposed standard for this is - Term: cumulative_probability_of_air_temperature - Description: A probability percentile. - Units: % (canonical units: 1)

Examples

Example 1: Only percentile threshold

This example aims to be as close to CF Conventions 1.9, Example 7.12 as possible, while still introducing the concept of percentile threshold.

It differs in the following ways:

Example 2: Only percentile threshold, timeseries

This example follows on the heels of Example 1. The only change is that here we are talking about a longer timeseries, where we are giving the number of days above a threshold per year for several years running.

variables:
  float n1(time, lat, lon);
    n1:standard_name="number_of_days_with_air_temperature_above_threshold";
    n1:coordinates="percentile_threshold";
    n1:cell_methods="time: maximum within days time: sum over days";
  float percentile_threshold;
    percentile_threshold:standard_name="cumulative_probability_of_air_temperature";
    percentile_threshold:units="%";
  double time(time);
    time:bounds="time_bounds";
    time:units="days since 1951-6-1";
  double time_bounds(time, nv);

data: // time coordinates translated to date/time format
  percentile_threshold=95.;
  time="1951-7-1",
       "1952-7-1",
       ...,
       "2000-7-1",
       "2001-7-1";
  time_bounds="1951-1-1", "1952-1-1",
             "1952-1-1", "1953-1-1",
             ...,
             "2000-1-1", "2001-1-1",
             "2001-1-1", "2002-1-1";

Example 3: Percentile and numerical threshold

The following example contains data that has been computed for a threshold derived from the percentile of a climatology.

data: // time coordinates translated to date/time format percentile_threshold=95.; time="1951-7-1", "1952-7-1", ..., "2000-7-1", "2001-7-1"; time_bounds="1951-1-1", "1952-1-1", "1952-1-1", "1953-1-1", ..., "2000-1-1", "2001-1-1", "2001-1-1", "2002-1-1"; reference_time="1985-1-1", "1985-1-2", "1985-1-3", .... "1985-12-29", "1985-12-30", "1985-12-31"; reference_time_bounds="1960-12-30", "1990-1-3", "1960-12-31", "1990-1-4", "1961-1-1", "1990-1-5", .... "1961-12-27", "1990-12-31", "1961-12-28", "1991-1-1", "1961-12-29", "1991-1-2",



Date: 2021-09-23
zklaus commented 3 years ago

Pinging participants of the CF 2021 Conventions Discussion that expressed interest: @bzah, @jesusff, @japamment, @larsbarring.

JonathanGregory commented 3 years ago

Dear @zklaus and others

Thanks for this. It seems fine to me to allow a threshold to be specified as a percentile, but I wonder whether there should be a different standard name percentile_of_X for each X. It seems asymmetric that you specify X when the threshold is in the units of X but not when it's a percentile. I realise that all percentiles are in the same unit i.e dimensionless, but that in itself isn't a reason to give them all the same standard name, because

Best wishes

Jonathan

zklaus commented 3 years ago

Dear @JonathanGregory, thanks for your comments. I am certainly open to the introduction of more specific percentile_of_X standard names. I see no downside and it would indeed provide the only encoding of the link between X and the corresponding percentile. I don't feel strongly about this, so if others disagree, let's discuss. For now, I will adapt the proposal accordingly.

larsbarring commented 3 years ago

Dear all,

First, with reference back to the conversation during the CF2021 workshop breakout discussion, I would like clarify that with percentile we here mean the numeric values in the range [0, 100] (inclusive), sometimes called "percentile probability".

Now over to @JonathanGregory's comment: I wonder if we actually do need to have a different standard name for each X, i.e. percentile_of_X?

There is nothing at all specific about a percentile_of_air_temperature that sets it apart from a percentile_of _lwe_thickness_of _precipitation_amount, both are just numeric values between 0 and 100. A percentile variable is just an auxiliary coordinate used used as "helper" -- here as threshold -- to produce the main data variable. As such it is different from both probability_distribution_of_X and histogram_of_X, which both can be seen as the main information carrying variable or end product. In addition, With few exceptions it is not meaningful to compare (or otherwise relate) probability_distribution_of_X to probability_distribution_of_Y if X and Y are different standard names (variables).

Finally, with reference to you third point, indeed there are use-cases for having multiple percentiles associated to different variables. Sometimes (often in fact) it is the same percentile value (e.g. 25 or 75), in which case one percentile variable will be enough (often actually meaningful in its context, else at least succinct). If different percentile values are needed two auxiliary coordinates are required, each one having its own [well chosen] variable name. But they can still have the same standard name percentile without risk of confusion.

JonathanGregory commented 3 years ago

Dear @larsbarring

Regarding:

There are use-cases for having multiple percentiles associated to different variables. ... If different percentile values are needed two auxiliary coordinates are required, each one having its own [well chosen] variable name. But they can still have the same standard name percentile without risk of confusion.

I think there would be risk of confusion. Variable names are arbitrary and meaningless in CF; some of the ways in which CF data are stored do not preserve variable names. I think that if you present a program with two coordinate variables that have the same standard name and units, it will cause some problem; disambiguating them would rely on some other non-standardised attribute like long_name. Why not distinguish them by standard name? They evidently must have different meanings.

In your use-cases, the percentile is a coordinate variable, but it might become a data variable. It is conceivable that you could have a latitude-longitude field of temperature percentile corresponding to a specified temperature value, for example. I think you would want to identify the field as percentile_of_air_temperature in that case, not just percentile.

It's true that percentiles are just numbers between 0 and 100. But sea ice fraction and cloud fraction are just numbers between 0 and 1, yet they are not the same quantity.

Best wishes

Jonathan

larsbarring commented 3 years ago

Dear @JonathanGregory

Regarding your first point, I both agree and don't. I agree that variable names are arbitrary and meaningless in CF, and that relying on long_nameor other attributes are likely to be fragile. But what I meant with "variable name" is more accurately described by the data model and specifically the arrow and constructs that link the data variable to its coordinate variables. In netcdf it is the variable names that does this. I do not think that the standard name necessarily fulfills the criterion to uniquely link the data variable to its coordinates. Just consider a file with holding climate index data (as per Klaus's example above) based two input variables, e.g. on 2m air temperature air temperature at pressure levels. The index for the 2m temperature is based on the 10th percentile, and the upper air index is based on the 5th percentile. Same standard name for input data air_temperature, same standard name for the thresholds percentile_of_air_temperature and same standard name for the indices. This [theoretical] example shows that we cannot rely on the standard name either.

Regarding your second point it is a different use for this standard name compared to what we are now discussing, and I do not know of such an use-case. But I take your point and agree that in principle this could happen and be of interest.

On your third point, I think that it depends on what you mean by quantity. One way to look at it is that the quantity is the fraction of the area covered by X, and what X is is another matter.

But, finally, having now aired some arguments against percentile_of_X and indirectly for percentiles I do not want to turn this into something that delays or stalls progress, so I can accept your view in favour of the former.

Kind regards, Lars

sethmcg commented 3 years ago

I have a use case for percentiles as a data variable! We are currently working on a project looking at fire indexes calculated from climate model outputs, and we have found that looking at the percentiles of the fire indexes as a spatial field is useful. If we want to publish that data in CF-compliant format, we will need a percentile_of_X type standard name. We've run into similar cases working with snowpack, heatwaves, drought, extreme precipitation, and other such variables important to climate impacts.

(I do think it would be valuable to consider a mechanism for defining standard names that adhere to a formula like X_of_Y_in_Z in an automatic and implicit way rather than doing it explicitly, but that's a separate topic.)

larsbarring commented 3 years ago

@sethmcg Thanks for your use case. Then percentile_of_X it is.

JonathanGregory commented 3 years ago

Dear @larsbarring

Thanks for your flexibility. You are right that the main point is that I support percentiles! I also agree with you that the standard name is not always sufficient metadata, but I think it helps to provide whatever metadata we can conveniently do, within the framework of the conventions.

Best wishes

Jonathan

larsbarring commented 3 years ago

OK, I will open a new issue for discussing in more detail the new standard names percentile_of_X, and we can here continue the main discussion on adding percentile thresholds to the existing standard names.

jesusff commented 3 years ago

Dear all, I think we are introducing some confusion here due to the use of the word "percentile" to refer to the probability associated to the percentile (@larsbarring mentioned this already but the discussion went on; we also discussed this during the CF workshop BOG). The percentile_of_X would have the units of X. I'm afraid the use-case put forward by @sethmcg refers to quantities such as FWI90 (the 90th percentile of the Fire Weather Index), which is a FWI value. A completely different scenario was mentioned by @JonathanGregory above:

It is conceivable that you could have a latitude-longitude field of temperature percentile corresponding to a specified temperature value, for example. I think you would want to identify the field as percentile_of_air_temperature in that case, not just percentile.

Here, a value of temperature would be fixed (say 20 degC) and the field would be percentile probabilities corresponding to that value in each point. This is why I think we should re-consider including the word probability in the standard name. I would just name it "percentile_probability" or, better, "quantile_probability" (to avoid future requests to have standard names for other particular quantile names such as quartiles or terciles, often used in other applications). Also, this would better comply with other standard name definitions for non-dimensional quantities, which have canonical units of 1. It would be weird to define a percentile number as %. The number in the range [1,99] refers to an integer quantity counting the position of the percentile. It is not a probability, despite abuses such as referring to the 99.9th percentile. This becomes clearer for other quantiles, such as the 3rd quartile, or the 8th decile.

JonathanGregory commented 3 years ago

Dear @jesusff

Oh yes, you're right. Sorry I didn't notice that I had slipped into the confusion. I agree it's confusing that percentile (in the sense Lars intended, which means a probability as %) and in my fictitious example of percentile_of_X in the data variable is not the same as what @sethmcg meant in his real example of percentile of FWI.

In that case, I think Lars's percentile should be called cumulative_probability_of_X and its canonical unit should be 1, as you say. Of course, it could still be given with units of %. Is that correct?

Best wishes

Jonathan

larsbarring commented 3 years ago

@jesusff, indeed, thank you.

@JonathanGregory I like the first part of the standard name you suggest, cumulative_probability. But coming back to my earlier comment, I do not see how this part is somehow related to (..._of_...) X. It is just a prescribed numeric value. As such it is not comparable to probability_distribution_of_X and histogram_of_X, which both are calculated from X.

But to complicate things, and build on @sethmcg's use case, I can imagine a situation where one would like to start with a cumulative probability value (percentile probability) and then calculate the corresponding percentile of a variable for some reference period, and then how the percentile probability of that particular value may change in some other period. While these two percentile probabilities are in principle similar they are used quite differently: the first is simply prescribed, the second is calculated from data.

I think that it might useful to distinguish between these two uses. Could the prescribed one have standard name cumulative_probability_point and the latter one have cumulative_probability_ofX?

JonathanGregory commented 3 years ago

Dear @larsbarring

I agree with you that it's debatable whether we should refer to a coordinate variable (for a data variable of frequency of extremes, for instance) as cumulative_probability_of_X or as just cumulative_probability, regardless of X (which might be air_temperature, precipitation_amount, etc.) You might argue that we don't have coordinate variables for latitude_of_air_temperature and latitude_of_precipitation_amount, just latitude. Here are a few arguments in favour of including X in the standard name:

In fact most of those are argument that we do have uses for cumulative_probability_of_X as a standard name. Those are different use-cases from yours. But if we have this standard name anyway, why not use it for coordinate variables always, as in your use-case, simply because it's more informative? If the absolute threshold for your derived statistic has a coordinate variable which identifies X, isn't it helpful that the cumulative probability (or percentile) coordinate should also identify X?

Best wishes

Jonathan

larsbarring commented 3 years ago

Dear @JonathanGregory,

I think that my point of view could be described as focussing on the 'fundamental nature' of the entity at hand, cumulative_probability. As I concluded above I do not want to prolong this debate and accept your points in favour of cumulative_probability_of_X.

Kind regards, Lars

JonathanGregory commented 3 years ago

Dear @larsbarring

Thanks for the discussion. It is a good exercise to work out the reasons. We shouldn't make things any more complicated than is useful. Regarding your interest in "fundamental nature", it has been commented before (not by me) that CF is all about "the essence of things". :-)

Best wishes

Jonathan

bzah commented 3 years ago

If I may bounce back on @larsbarring and @sethmcg example:

But to complicate things, and build on @sethmcg's use case, I can imagine a situation where one would like to start with a cumulative probability value (percentile probability) and then calculate the corresponding percentile of a variable for some reference period, and then how the percentile probability of that particular value may change in some other period. While these two percentile probabilities are in principle similar they are used quite differently: the first is simply prescribed, the second is calculated from data.

It seems very common to compute the percentile values on a reference period instead of the whole period.

How can we link this reference period to cumulative_probability_of_X ?

Illustrating with an example: Let's say: 1. I want the climate index **Rx90p** which would translate to (if I understood correctly the above discussion) `number_of_days_with_air_temperature_above_threshold` with a coordinate variable `cumulative_probability_of_air_temperature = 90`. 2. My studied sampled goes from 1950 to 2100. 3. I want to compute the percentile values only for the reference period 1950-1980. The output would be something like (I'm stealing Klaus example) ``` variables: float n1(lat,lon); n1:standard_name="number_of_days_with_air_temperature_above_threshold"; n1:coordinates="percentile_threshold time"; n1:cell_methods="time: maximum within days time: sum over days"; float percentile_threshold; percentile_threshold:standard_name="percentile_of_air_temperature"; percentile_threshold:units="%"; double time; time:climatology="climatology_bounds"; time:units="days since 1950-01-01"; double climatology_bounds(time,nv); data: // time coordinates translated to date/time format time="1950-01-01 6:00", ... ,"2100-12-31 6:00",; climatology_bounds="1950-01-01 6:00", "1950-01-05 6:00" ... ; percentile_threshold=90.; ``` But the reference period is missing. Thus the user cannot understand that the threshold used to retrieve `n1` was only computed on a part of the original data (1950-1980 in my example). I think it changes a lot the analysis of `n1`. Can we use something like `cumulative_probability_bounds` somewhat similar to `climatology_bounds` ? It could look like: ``` dimensions: ... bounds = 2 ... float percentile_threshold; percentile_threshold:standard_name="percentile_of_air_temperature"; percentile_threshold:units="%"; percentile_threshold:bounds="percentile_threshold_bounds" double percentile_threshold_bounds(percentile_threshold, bounds) data: ... percentile_threshold_bounds="1950-01-01", "1980-01-01" ```
larsbarring commented 3 years ago

Hello Abel @bzah

I think that we need to distinguish between the two use cases:

With respect to your illustrative example I guess that you intend the standard name for the percentile_threshold variable to be cumulative_probability_of_X, which is consistent with percentile_threshold:units="%" (there was a slight confusion earlier in the thread (see this comment, and this). This variable cannot have time bounds.

bzah commented 3 years ago

I read all the comments and still was confused. Thank you Lars for clarifying things for me once again!

zklaus commented 3 years ago

I think we have reached on consensus on the use of specific cumulative_probability_of_X standard names and have updated the description at the top of this issue accordingly.

I have also added more examples that @larsbarring and I have been developing independently from @bzah, but coincidentally very much in the same spirit.

Following those examples, particularly Example 3, I would like to turn the discussion to the encoding of the reference period that is often used for the derivation of thresholds from time-series and given cumulative probability values. I would be very grateful if you could have a look at the examples and comment on whether you agree with the approach, whether you would like to see a change or clarification in the explanatory text, or whether you think some aspects of the reference period should be encoded in a different way.

JonathanGregory commented 3 years ago

Dear @zklaus

I agree with your intentions and values for the variables in Example 3 - thanks. I have a reservation about the status of the threshold variable. You have listed it in the coordinates attribute of the data variable, but it doesn't qualify as an auxiliary coordinary variable because its dimension reference_time is not a dimension of the data variable. I think this variable is a bit "more" than an auxiliary coordinate variable in function. It's more like a data variable containing a particular statistic of air temperature (95th percentile). Therefore I think it should be stored as an ancillary variable, which means it could have other dimensions than those of the data variable.

Also threshold could have cell_methods to describe how it was derived. If we were using the 50th percentile to define the threshold, its cell_methods would be reference_time: maximum within years time: median over years. We don't have a cell method for an arbitrary percentile, however. This might be a new thing we need to propose.

Best wishes

Jonathan

zklaus commented 3 years ago

Dear @JonathanGregory,

thanks for your comments. I agree that it makes a lot of sense to treat the threshold as an ancillary variable. Indeed, that makes me think that perhaps threshold in the original standard name should also be an ancillary variable instead of a scalar coordinate, backward compatibility notwithstanding. I am not sure this is really mandated by the conventions since they are a bit fuzzy on the details of auxiliary coordinates, and, for completeness sake, I think it is conceivable to amend the concept of auxiliary coordinates to allow for it, but I like the formulation with ancillary variables better.

On cell methods, I also completely agree. Let's side-step the issue here by adopting the 50th percentile and going with median in the example and discuss the possible addition of a more complete set of cell methods in a separate issue.

Cheers Klaus

github-actions[bot] commented 1 year ago

This issue has had no activity in the last 30 days. This is a reminder to please comment on standard name requests to assist with agreement and acceptance. Standard name moderators are also reminded to review @feggleton @japamment

larsbarring commented 1 year ago

Before moving these suggested standard names towards acceptance, I would like to refer to the problems of having canonical units 1 for these number _of_days_... standard names. This has been discussed elsewhere (#110, as well as in cf-convention/discuss#190). Thus I suggest that the problem of finding an adequate and user-friendly canoncal unit for the number_of_days_...standard names are coordinated between this issue and cf-convention/vocabularies#14.

davidhassell commented 1 year ago

Thanks, @larsbarring, that makes sense.

github-actions[bot] commented 2 months ago

This issue has had no activity in the last 30 days. Accordingly:

Standard name moderators are also reminded to review @feggleton @japamment @efisher008