Open zklaus opened 3 years ago
Pinging participants of the CF 2021 Conventions Discussion that expressed interest: @bzah, @jesusff, @japamment, @larsbarring.
Dear @zklaus and others
Thanks for this. It seems fine to me to allow a threshold to be specified as a percentile, but I wonder whether there should be a different standard name percentile_of_
X for each X. It seems asymmetric that you specify X when the threshold is in the units of X but not when it's a percentile. I realise that all percentiles are in the same unit i.e dimensionless, but that in itself isn't a reason to give them all the same standard name, because
There are many groups of standard names with the same canonical unit. For example, we don't have just one standard name for "temperature", and all direction_of
standard names are angles.
The standard name guidelines specify rules probability_distribution_of_
X and histogram_of_
X, which are dimensionless, like percentile
, although they contain X. There are a couple of histogram
names in the standard name table and one probability_distribution
.
In some circumstances it could be useful to have different percentile names for different X. For example, you could imagine a quantity which depended on both percentile_of_air_temperature
and percentile_of_lwe_precipitation_rate
.
Best wishes
Jonathan
Dear @JonathanGregory,
thanks for your comments. I am certainly open to the introduction of more specific percentile_of_X
standard names. I see no downside and it would indeed provide the only encoding of the link between X and the corresponding percentile. I don't feel strongly about this, so if others disagree, let's discuss. For now, I will adapt the proposal accordingly.
Dear all,
First, with reference back to the conversation during the CF2021 workshop breakout discussion, I would like clarify that with percentile
we here mean the numeric values in the range [0, 100] (inclusive), sometimes called "percentile probability".
Now over to @JonathanGregory's comment:
I wonder if we actually do need to have a different standard name for each X, i.e. percentile_of_
X?
There is nothing at all specific about a percentile_of_air_temperature
that sets it apart from a percentile_of _lwe_thickness_of _precipitation_amount
, both are just numeric values between 0 and 100. A percentile
variable is just an auxiliary coordinate used used as "helper" -- here as threshold -- to produce the main data variable. As such it is different from both probability_distribution_of_
X and histogram_of_
X, which both can be seen as the main information carrying variable or end product. In addition, With few exceptions it is not meaningful to compare (or otherwise relate) probability_distribution_of_X
to probability_distribution_of_Y
if X and Y are different standard names (variables).
Finally, with reference to you third point, indeed there are use-cases for having multiple percentiles associated to different variables. Sometimes (often in fact) it is the same percentile value (e.g. 25 or 75), in which case one percentile variable will be enough (often actually meaningful in its context, else at least succinct). If different percentile values are needed two auxiliary coordinates are required, each one having its own [well chosen] variable name. But they can still have the same standard name percentile
without risk of confusion.
Dear @larsbarring
Regarding:
There are use-cases for having multiple percentiles associated to different variables. ... If different percentile values are needed two auxiliary coordinates are required, each one having its own [well chosen] variable name. But they can still have the same standard name percentile without risk of confusion.
I think there would be risk of confusion. Variable names are arbitrary and meaningless in CF; some of the ways in which CF data are stored do not preserve variable names. I think that if you present a program with two coordinate variables that have the same standard name and units, it will cause some problem; disambiguating them would rely on some other non-standardised attribute like long_name
. Why not distinguish them by standard name? They evidently must have different meanings.
In your use-cases, the percentile is a coordinate variable, but it might become a data variable. It is conceivable that you could have a latitude-longitude field of temperature percentile corresponding to a specified temperature value, for example. I think you would want to identify the field as percentile_of_air_temperature
in that case, not just percentile
.
It's true that percentiles are just numbers between 0 and 100. But sea ice fraction and cloud fraction are just numbers between 0 and 1, yet they are not the same quantity.
Best wishes
Jonathan
Dear @JonathanGregory
Regarding your first point, I both agree and don't. I agree that variable names are arbitrary and meaningless in CF, and that relying on long_name
or other attributes are likely to be fragile. But what I meant with "variable name" is more accurately described by the data model and specifically the arrow and constructs that link the data variable to its coordinate variables. In netcdf it is the variable names that does this. I do not think that the standard name necessarily fulfills the criterion to uniquely link the data variable to its coordinates. Just consider a file with holding climate index data (as per Klaus's example above) based two input variables, e.g. on 2m air temperature air temperature at pressure levels. The index for the 2m temperature is based on the 10th percentile, and the upper air index is based on the 5th percentile. Same standard name for input data air_temperature
, same standard name for the thresholds percentile_of_air_temperature
and same standard name for the indices. This [theoretical] example shows that we cannot rely on the standard name either.
Regarding your second point it is a different use for this standard name compared to what we are now discussing, and I do not know of such an use-case. But I take your point and agree that in principle this could happen and be of interest.
On your third point, I think that it depends on what you mean by quantity. One way to look at it is that the quantity is the fraction of the area covered by X, and what X is is another matter.
But, finally, having now aired some arguments against percentile_of_
X and indirectly for percentiles
I do not want to turn this into something that delays or stalls progress, so I can accept your view in favour of the former.
Kind regards, Lars
I have a use case for percentiles as a data variable! We are currently working on a project looking at fire indexes calculated from climate model outputs, and we have found that looking at the percentiles of the fire indexes as a spatial field is useful. If we want to publish that data in CF-compliant format, we will need a percentile_of_X
type standard name. We've run into similar cases working with snowpack, heatwaves, drought, extreme precipitation, and other such variables important to climate impacts.
(I do think it would be valuable to consider a mechanism for defining standard names that adhere to a formula like X_of_Y_in_Z in an automatic and implicit way rather than doing it explicitly, but that's a separate topic.)
@sethmcg Thanks for your use case. Then percentile_of_
X it is.
Dear @larsbarring
Thanks for your flexibility. You are right that the main point is that I support percentiles! I also agree with you that the standard name is not always sufficient metadata, but I think it helps to provide whatever metadata we can conveniently do, within the framework of the conventions.
Best wishes
Jonathan
OK, I will open a new issue for discussing in more detail the new standard names percentile_of_
X, and we can here continue the main discussion on adding percentile thresholds to the existing standard names.
Dear all, I think we are introducing some confusion here due to the use of the word "percentile" to refer to the probability associated to the percentile (@larsbarring mentioned this already but the discussion went on; we also discussed this during the CF workshop BOG). The percentile_of_X would have the units of X. I'm afraid the use-case put forward by @sethmcg refers to quantities such as FWI90 (the 90th percentile of the Fire Weather Index), which is a FWI value. A completely different scenario was mentioned by @JonathanGregory above:
It is conceivable that you could have a latitude-longitude field of temperature percentile corresponding to a specified temperature value, for example. I think you would want to identify the field as percentile_of_air_temperature in that case, not just percentile.
Here, a value of temperature would be fixed (say 20 degC) and the field would be percentile probabilities corresponding to that value in each point. This is why I think we should re-consider including the word probability in the standard name. I would just name it "percentile_probability" or, better, "quantile_probability" (to avoid future requests to have standard names for other particular quantile names such as quartiles or terciles, often used in other applications). Also, this would better comply with other standard name definitions for non-dimensional quantities, which have canonical units of 1. It would be weird to define a percentile number as %. The number in the range [1,99] refers to an integer quantity counting the position of the percentile. It is not a probability, despite abuses such as referring to the 99.9th percentile. This becomes clearer for other quantiles, such as the 3rd quartile, or the 8th decile.
Dear @jesusff
Oh yes, you're right. Sorry I didn't notice that I had slipped into the confusion. I agree it's confusing that percentile
(in the sense Lars intended, which means a probability as %) and in my fictitious example of percentile_of_
X in the data variable is not the same as what @sethmcg meant in his real example of percentile of FWI.
In that case, I think Lars's percentile should be called cumulative_probability_of_
X and its canonical unit should be 1
, as you say. Of course, it could still be given with units of %
. Is that correct?
Best wishes
Jonathan
@jesusff, indeed, thank you.
@JonathanGregory I like the first part of the standard name you suggest, cumulative_probability
. But coming back to my earlier comment, I do not see how this part is somehow related to (..._of_...
) X. It is just a prescribed numeric value. As such it is not comparable to probability_distribution_of_X
and histogram_of_X
, which both are calculated from X.
But to complicate things, and build on @sethmcg's use case, I can imagine a situation where one would like to start with a cumulative probability value (percentile probability) and then calculate the corresponding percentile of a variable for some reference period, and then how the percentile probability of that particular value may change in some other period. While these two percentile probabilities are in principle similar they are used quite differently: the first is simply prescribed, the second is calculated from data.
I think that it might useful to distinguish between these two uses. Could the prescribed one have standard name cumulative_probability_point
and the latter one have cumulative_probability_of
X?
Dear @larsbarring
I agree with you that it's debatable whether we should refer to a coordinate variable (for a data variable of frequency of extremes, for instance) as cumulative_probability_of_
X or as just cumulative_probability
, regardless of X (which might be air_temperature
, precipitation_amount
, etc.) You might argue that we don't have coordinate variables for latitude_of_air_temperature
and latitude_of_precipitation_amount
, just latitude
. Here are a few arguments in favour of including X in the standard name:
Unlike latitude, longitude, time, etc., cumulative probability is not a geolocated coordinate. It doesn't have an absolute meaning. That is an explanation why it may need X to be self-explanatory whereas those others don't.
As you say yourself, the cumulative probability might really have been calculated from data, for which we would want cumulative_probability_of_
X for the coordinate variable. My second argument in https://github.com/cf-convention/vocabularies/issues/19, where the cumulative probability is a data variable, is another reason for wanting cumulative_probability_of_
X as a standard name.
My first argument in https://github.com/cf-convention/vocabularies/issues/19, that you might have a data variable which depended simultaneously on the cumulative probability of X and Y, and you want to distinguish those two coordinate variables. You countered this by saying that X and Y might have the same standard name, so this might not solve the problem. I agree that's true, but I think that they're more likely to be different, when it would help to have distinct standard names for them.
Simpler than that argument, we might consider the case where you have a data variable of something that is a function of X but not a derived statistic of X. For example, you might consider the precipitation_amount
(data variable) as a function of the cumulative_probability_of_air_temperature
(coordinate variable). If you drew it as a graph, you'd label the axes like that. That's yet another reason for needing cumulative_probability_of_
X.
In fact most of those are argument that we do have uses for cumulative_probability_of_
X as a standard name. Those are different use-cases from yours. But if we have this standard name anyway, why not use it for coordinate variables always, as in your use-case, simply because it's more informative? If the absolute threshold for your derived statistic has a coordinate variable which identifies X, isn't it helpful that the cumulative probability (or percentile) coordinate should also identify X?
Best wishes
Jonathan
Dear @JonathanGregory,
I think that my point of view could be described as focussing on the 'fundamental nature' of the entity at hand, cumulative_probability
. As I concluded above I do not want to prolong this debate and accept your points in favour of cumulative_probability_of_
X.
Kind regards, Lars
Dear @larsbarring
Thanks for the discussion. It is a good exercise to work out the reasons. We shouldn't make things any more complicated than is useful. Regarding your interest in "fundamental nature", it has been commented before (not by me) that CF is all about "the essence of things". :-)
Best wishes
Jonathan
If I may bounce back on @larsbarring and @sethmcg example:
But to complicate things, and build on @sethmcg's use case, I can imagine a situation where one would like to start with a cumulative probability value (percentile probability) and then calculate the corresponding percentile of a variable for some reference period, and then how the percentile probability of that particular value may change in some other period. While these two percentile probabilities are in principle similar they are used quite differently: the first is simply prescribed, the second is calculated from data.
It seems very common to compute the percentile values on a reference period instead of the whole period.
How can we link this reference period to cumulative_probability_of_X
?
Hello Abel @bzah
I think that we need to distinguish between the two use cases:
In your use case the cumulative_probability_of_X
is a constant (or possibly a list of constants) that is just set by the analyst. It has nothing specific to do with any particular reference period. But it is used to calculate a climatology of X for some reference period. The reference climatology thus 'depends' on the cumulative_probability_of_X
and on the reference period. The latter can be recorded in a variable having standard name reference_epoch
. Both these could be linked as coordinates of the variable storing the reference climatology. This is the typical situation for the case of climate indices. If you however do not want to store the reference climatology (that would hold the actual percentile thresholds values), then I think that reference period information could be linked as a coordinate variable to the main data variable (the one having standard name number_of_days_with_X_...
In the other use case the situation for cumulative_probability_of_X
is somewhat reversed. Here the data for X is given and from that the probability is somehow calculated. Often, the calculation is based on some period of time, which might be shorter than the full data period covered by X. And the resulting variable would often be a 2d-field (x, y), or possibly 3d (n, x, y) if several thresholds or reference epochs are used.
With respect to your illustrative example I guess that you intend the standard name for the percentile_threshold
variable to be cumulative_probability_of_X
, which is consistent with percentile_threshold:units="%"
(there was a slight confusion earlier in the thread (see this comment, and this). This variable cannot have time bounds.
I read all the comments and still was confused. Thank you Lars for clarifying things for me once again!
I think we have reached on consensus on the use of specific cumulative_probability_of_X
standard names and have updated the description at the top of this issue accordingly.
I have also added more examples that @larsbarring and I have been developing independently from @bzah, but coincidentally very much in the same spirit.
Following those examples, particularly Example 3, I would like to turn the discussion to the encoding of the reference period that is often used for the derivation of thresholds from time-series and given cumulative probability values. I would be very grateful if you could have a look at the examples and comment on whether you agree with the approach, whether you would like to see a change or clarification in the explanatory text, or whether you think some aspects of the reference period should be encoded in a different way.
Dear @zklaus
I agree with your intentions and values for the variables in Example 3 - thanks. I have a reservation about the status of the threshold
variable. You have listed it in the coordinates
attribute of the data variable, but it doesn't qualify as an auxiliary coordinary variable because its dimension reference_time
is not a dimension of the data variable. I think this variable is a bit "more" than an auxiliary coordinate variable in function. It's more like a data variable containing a particular statistic of air temperature (95th percentile). Therefore I think it should be stored as an ancillary variable, which means it could have other dimensions than those of the data variable.
Also threshold
could have cell_methods
to describe how it was derived. If we were using the 50th percentile to define the threshold, its cell_methods
would be reference_time: maximum within years time: median over years
. We don't have a cell method for an arbitrary percentile, however. This might be a new thing we need to propose.
Best wishes
Jonathan
Dear @JonathanGregory,
thanks for your comments. I agree that it makes a lot of sense to treat the threshold
as an ancillary variable. Indeed, that makes me think that perhaps threshold
in the original standard name should also be an ancillary variable instead of a scalar coordinate, backward compatibility notwithstanding. I am not sure this is really mandated by the conventions since they are a bit fuzzy on the details of auxiliary coordinates, and, for completeness sake, I think it is conceivable to amend the concept of auxiliary coordinates to allow for it, but I like the formulation with ancillary variables better.
On cell methods, I also completely agree. Let's side-step the issue here by adopting the 50th percentile and going with median in the example and discuss the possible addition of a more complete set of cell methods in a separate issue.
Cheers Klaus
This issue has had no activity in the last 30 days. This is a reminder to please comment on standard name requests to assist with agreement and acceptance. Standard name moderators are also reminded to review @feggleton @japamment
Before moving these suggested standard names towards acceptance, I would like to refer to the problems of having canonical units 1
for these number _of_days_...
standard names. This has been discussed elsewhere (#110, as well as in cf-convention/discuss#190). Thus I suggest that the problem of finding an adequate and user-friendly canoncal unit for the number_of_days_...
standard names are coordinated between this issue and cf-convention/vocabularies#14.
Thanks, @larsbarring, that makes sense.
This issue has had no activity in the last 30 days. Accordingly:
Standard name moderators are also reminded to review @feggleton @japamment @efisher008
Introduction
This issue describes a proposed change to the description text of existing, threshold-based standard names. It is the result of a number of discussions, most recently at the 2021 CF Conventions, climate index breakout group.
To allow for concrete discussions, the proposed change is first discussed as a concrete example. As such, it is based on the following current definition.
Changelog
This changelog is intended to allow for quickly catching up. If you are new to the issue or are coming back to it after some time, this summary should give you the most important information and you need to start reading only after the last comment mentioned in the following table.
Please let me know if you feel the table does not reflect the consensus appropriately!
and including
percentile(_of_X)
withcumulative_probability_of_X
Current Definition
number_of_days_with_air_temperature_above_threshold
Air temperature is the bulk temperature of the air, not the surface (skin) temperature. A variable whose standard name has the form
number_of_days_with_X_below|above_threshold
is a count of the number of days on which the conditionX_below|above_threshold
is satisfied. It must have a coordinate variable or scalar coordinate variable with the standard name of X to supply the threshold(s). It must have a climatological time variable, and acell_methods
entry for within days which describes the processing of quantityX
before the threshold is applied. Anumber_of_days
is an extensive quantity in time, and thecell_methods
entry for over days should be"sum"
.Proposed Definition
In the following proposed definition, the first paragraph is unchanged except for the removal of the sentence about the threshold coordinate variable, which is in its modified form in the second paragraph.
number_of_days_with_air_temperature_above_threshold
Air temperature is the bulk temperature of the air, not the surface (skin) temperature. A variable whose standard name has the form
number_of_days_with_X_below|above_threshold
is a count of the number of days on which the conditionX_below|above_threshold
is satisfied. It must have a climatological time variable, and acell_methods
entry for within days which describes the processing of quantityX
before the threshold is applied. Anumber_of_days
is an extensive quantity in time, and thecell_methods
entry for over days should be"sum"
.It must give information about the threshold in one or both of the following two ways. With an explicit threshold in a coordinate variable or scalar coordinate variable with the standard name of
X
, or with a percentile threshold given in a scalar coordinate variable with the standard namecumulative_probability_of_X
.Implied Changes
The proposed definition given above requires the addition of new standard names,
cumulative_probability_of_X
. The proposed standard for this is - Term:cumulative_probability_of_air_temperature
- Description: A probability percentile. - Units: % (canonical units: 1)Examples
Example 1: Only percentile threshold
This example aims to be as close to CF Conventions 1.9, Example 7.12 as possible, while still introducing the concept of percentile threshold.
It differs in the following ways:
n2
(spell length) has been removed for simplification...._below_...
to..._above_...
to follow this issue.Example 2: Only percentile threshold, timeseries
This example follows on the heels of Example 1. The only change is that here we are talking about a longer timeseries, where we are giving the number of days above a threshold per year for several years running.
Example 3: Percentile and numerical threshold
The following example contains data that has been computed for a threshold derived from the percentile of a climatology.
n1
contains the number of days per year above that threshold. Note that thetime
coordinate is a dimensional coordinate and not climatological.percentile_threshold
is the scalar that gives the percentile that underpins the threshold.threshold
is the field of thresholds over space andreference_time
, meaning essentially day-of-year, but seereference_time
below for details.reference_time
gives the reference period that was used for the calculation of the threshold from the percentile. In this case, it is derived from a 5 day window centered on the each day of year over a 30 year climatology.data: // time coordinates translated to date/time format percentile_threshold=95.; time="1951-7-1", "1952-7-1", ..., "2000-7-1", "2001-7-1"; time_bounds="1951-1-1", "1952-1-1", "1952-1-1", "1953-1-1", ..., "2000-1-1", "2001-1-1", "2001-1-1", "2002-1-1"; reference_time="1985-1-1", "1985-1-2", "1985-1-3", .... "1985-12-29", "1985-12-30", "1985-12-31"; reference_time_bounds="1960-12-30", "1990-1-3", "1960-12-31", "1990-1-4", "1961-1-1", "1990-1-5", .... "1961-12-27", "1990-12-31", "1961-12-28", "1991-1-1", "1961-12-29", "1991-1-2",