Closed sethmcg closed 3 months ago
No thoughts as yet, but it reminded me of cf-convention/vocabularies#188 and cf-convention/vocabularies#27, which I have only skimmed to remind me that they were talking about similar themes.
Dear @davidhassell and @sethmcg
David, thanks for reminding us about https://github.com/cf-convention/vocabularies/issues/188, which does seem to be the answer to the question - would you agree, @sethmcg, or have I misunderstood? In fact, https://github.com/cf-convention/vocabularies/issues/27 mentions air_temperature_anomaly
as being one of the quantities which might use reference_epoch
. The suggestion is to record the bounds of the climatology as the bounds of a size-1 or scalar coordinate variable with standard name of reference_epoch
, which is now in the standard table. Its description is
The period of time over which a parameter has been summarised (usually by averaging) in order to provide a reference (baseline) against which data has been compared. When a coordinate, scalar coordinate, or auxiliary coordinate variable with this standard name has bounds, then the bounds specify the beginning and end of the time period over which the reference was determined. If the reference represents an instant in time, rather than a period, then bounds may be omitted. It is not the time for which the actual measurements are valid; the standard name of time should be used for that.
For example
variables:
float delta_tas(time);
delta_tas:standard_name="air_temperature_anomaly";
delta_tas:units="degC";
delta_tas:coordinates="climatology";
delta_tas:cell_methods="time: maximum";
double time(time);
time:standard_name="time";
time:units="days since 2023-7-16";
time:bounds="time_bounds";
double time_bounds(time,two);
double climatology;
climatology:standard_name="reference_epoch";
climatology:units="days since 1990-1-1";
climatology:bounds="climatology_bounds";
double climatology_bounds(two);
data:
time_bounds=0,1, 1,2, 2,3, 3,4;
climatology_bounds=0,10957;
indicates that delta_tas
contains daily maximum temperatures for 16th-19th July 2023 expressed as anomalies with respect to the climatology of 1990-2019. Is that what we need?
It occurs to me that it's not recorded what sort of climatology this was: is it annual means, monthly means, monthly maxima, maxima for the day of the year? This would be apparent from the cell_methods
of the climatology, which has not been preserved.
Best wishes
Jonathan
Ah, perfect! Yes, that's exactly what I was looking for. Thanks much @davidhassell for the pointer to those issues and @JonathanGregory for the excellent example usage!
I note that cf-convention/vocabularies#27 is still open, so I could go jump in there and add a couple more standard names for consideration to have that reference epoch sentence added to their description. That might also jumpstart the conversation and move the issue forward.
For the specific use case that prompted this issue (reference period for calculating KBDI), I can follow Jonathan's example and we're all set. (And that applies regardless of whether and when we update thestandard_name
description for KBDI to mention reference_epoch
.)
However, I have thought of two use cases for other data that I work with that have me wondering whether we need to add something to the Conventions about reference epochs in general, and not just have it mentioned in the description of certain standard names.
First: bias-correction. When you apply a bias-correction to model output, you're doing some kind of transformation (maybe simple, maybe hideously complex) to make your model dataset look (statistically) more like your observational dataset. That observational dataset has some finite temporal coverage, and reference_epoch
seems like just the thing to capture that information. However, this could be applied to any variable, not just the anomaly variables discussed in cf-convention/vocabularies#27. So that seems to warrant some mention somewhere outside the Standard Name Table. This is also a case where you might have a reference_epoch
that was an array of values (e.g., if you've done the bias-correction on a monthly basis), which is something @martinjuckes mentions in cf-convention/vocabularies#27, and which probably deserves more fleshing-out than the description of reference_epoch
in the SNT.
Second -- and I've actually been doing this recently -- if you're looking at something like the change in average zg500 anomaly on rainy days, you want to track two reference periods: one for the past and one for the future. So do you need two such variables with different names? Do you need to add a categorical dimension (past, future) to the variable? Something else entirely? In any case, I think it warrants some consideration.
What's the best way to proceed from here? Close this ticket and bring these issues up in cf-convention/vocabularies#27? Open a new ticket with a better name? Keep the discussion here going and cross-reference it over there?
Hi @sethmcg I think it would be most welcome if you would be able to update and move cf-convention/vocabularies#27 forward. And I agree that @martinjuckes comment on having multiple reference periods is a relevant one.
However, for you second use case, do you really need two reference periods? Wouldn't it be enough to make the first period the reference_period
, and then use the bounds of the anomaly variable to record the extent of the second?
Regarding your last question, would it be possible to split your ideas an suggestions so that what is directly relevant for cf-convention/vocabularies#27 goes there, and then we can we can continue the the broader conversation here?
Regarding your first use case (on bias adjustment), I have been thinking about this for some time now and the reference period (reference_epoch
) is one component, but there are other equally important ones, like what dataset is used as reference.
I've added the FAQ label to this issue to mark it for inclusion in the FAQ when we have finished answering it.
@larsbarring I like your suggestion to move the issues of the standard names to cf-convention/vocabularies#27 and continue the general discussion here. I will do that.
You're right that for the second use case, a reference_period
combined with climatology bounds on the anomaly variable captures the two periods, so I think we can regard that one as covered. Thanks!
On the more complex topic of bias-correction, one point we could start with is whether the different components can be separated, or whether a solution should encompass all of them. I think they can and should be separated, so that e.g., we address the issue of the reference_epoch
independently of how we address the issue of referring to the reference dataset, which is in turn addressed separately from the question of how data was divided up by month/season/moving window. That allows you to use only the necessary elements and omit those that don't apply (and may not even make sense), and also makes it easy to extend the convention in the future as new requirements come on the scene. What that implies, though, is that we need some way to tie all of those elements together and identify them as capturing information about the bias-correction (or whatever other higher-level processing applies). The first solution that springs to mind is something like grid_mapping
, where you have a dummy variable that you attach all the attributes to but that has no actual data values, and that is referred to by an attribute on the main data variable. Would it make sense to try and extend that pattern? Thoughts?
@JonathanGregory Thanks. The FAQ does seem like the right place to put that explanation. Is there anything in particular we should make sure to address to generate a good Frequently Provided Answer?
Dear @sethmcg
The FAQ does seem like the right place to put that explanation. Is there anything in particular we should make sure to address to generate a good Frequently Provided Answer?
No, there are no guidelines for this. Anything which you, as a questioner, might find useful in the answer! We haven't been adding to the FAQ, and I feel that it would be useful to do so.
Best wishes and thanks
Jonathan
Dear @JonathanGregory ,
In your answer Jul 19th you note:
It occurs to me that it's not recorded what sort of climatology this was: is it annual means, monthly means, monthly maxima, maxima for the day of the year? This would be apparent from the cell_methods of the climatology, which has not been preserved.
Isn't the issue that the _referenceepoch variable is tasked to record the time period, and not how the reference value (aka climatology) was computed?
If we look at the way CF allows to store climatologies we see that the :cell_methods attribute to describe how the climatology was computed is in the variable storing the climatological value, while the time period on which the climatology is computed is encoded in the time variable with attribute :climatology.
Should the variable holding the anomaly also stores how the climatology it refers to was prepared? For example a _reference_cellmethods, in addition to the _referenceepoch?
At a more general level, I am surprised that the way to store an anomaly (standard_name x_anomaly + reference_epoch) and the way to store climatologies (attribute :climatology) have so little in common. Anomalies and climatologies are not the same thing, but their descriptions in the CF world could probably be more streamlined.
Best wishes, Thomas
Dear Thomas @TomLav
Thanks for your comment. It might be helpful to distinguish two meanings of "climatology". (1) The reference data variable that was used to compute the anomalies stored in another data variable. (2) A data variable which has a climatological time dimension.
The climatology
attribute (instead of the bounds
attribute) is used to name the bounds variable of a climatological time variable, in sense (2). It should be used if and only if the cell_methods
entry for that dimension shows that's climatological time, as described in Sect 7.4 i.e. cell_methods
has within
and over
in it. This is to describe statistics for variation within the diurnal or the annual cycle, over many repetitions of the cycles.
The climatology in sense (1) doesn't haven't to be a climatology in sense (2), but it can be. For instance, you could calculate anomalies for monthly means wrt a 30-year mean or wrt 30-year monthly means. The latter is a climatology in sense (2) and the former is not.
The solution we've discussed in this issue proposes to use the reference_epoch
standard name for a scalar time dimension, to record the time bounds of a climatology in sense (1). That's useful information, but it isn't necessarily enough. For example, if you have a 20th September anomaly wrt a reference_epoch of 1990-2019, does that mean it's a difference from the 30-year mean, from the 30-mean September mean, or from the 30-year mean of 20th September? In the cell_methods
of the reference climatology, the first one is not climatological time in sense (2). The cell_methods
would have just time: mean
for the 30-year mean. The other two cases are both time: mean within years time: mean over years
and the reference epoch time bounds are the same for both, namely midnight on 1st January 1990 and midnight on 1st January 2020.
If the above is correct (I'm not sure it is!) then to be clear about the climatology that was used to calculate the anomalies, we need the time bounds and the cell_methods
of the climatology (in sense 1). We could store them as attributes of the anomaly data variable, but I think it would be neater to put them in a dummy variable, which stands for the climatology, but without the data. Then it could have all its metadata, which might be useful. For instance:
variables:
float delta_tas(time);
delta_tas:standard_name="air_temperature_anomaly";
delta_tas:units="degC";
delta_tas:coordinates="reference_epoch";
delta_tas:cell_methods="time: maximum";
delta_tas:ancillary_variables="climatological_tas";
double time(time);
time:standard_name="time";
time:units="days since 2023-7-16";
time:bounds="time_bounds";
double time_bounds(time,two);
double reference_epoch;
reference_epoch:standard_name="reference_epoch";
reference_epoch:units="days since 1990-1-1";
reference_epoch:bounds="reference_epoch_bounds";
double reference_epoch_bounds(two);
float climatological_tas;
climatological_tas:standard_name="air_temperature";
climatological_tas:units="degC";
climatological_tas:cell_methods="climatology_time: mean within years climatology_time: mean over years";
double climatology_time(climatology_time);
climatology_time:climatology="climatology_time_bounds";
climatology_time:units="days since 1990-1-1";
double climatology_time_bounds(climatology_time,two);
data:
time_bounds=0,1, 1,2, 2,3, 3,4;
reference_epoch_bounds=0,10957;
climatology_time_bounds=0,10623, 31,10651, ...
In this example, I've kept the reference_epoch
variable but it's renamed. I've introduced climatological_tas
as an ancillary variable of delta_tas
. It is dummy climatology variable which has the cell_methods
of the climatology but no data. It stands for a data variable which actually had dimensions, something like climatology_tas(climatology_time,latitude,longitude)
. The time axis of the climatology is climatology_time
, which is identified by being named in the cell_methods
attribute. Its bounds climatology_time_bounds
are climatological (in sense 2). (0,10623) is 1st January 1990 to 1st February 2019, (31,10651) is 1st February 1990 to 1st March 2019, etc. From this we know that the daily maximum temperature anomalies have been calculated wrt monthly mean climatological temperature.
Does this makes sense and would it be sufficient? It's just a suggestion.
Best wishes
Jonathan
I hate to muddy the waters, but the more I think about this problem, the messier it gets.
I started working on an example to move cf-convention/vocabularies#27 forward (before @JonathanGregory beat me to it above) and realized that there are a number of different cases we need to consider, and I'm not sure whether or not the proposed solution can handle all of them. (Plus, as @TomLav points out, there's a bit of a disconnect between how we store the anomaly / derived value and the reference it's relative to.) So maybe it will be useful to spell them out, so that we can at least be clear on the cases we need to address. I have actual use cases for all of them, so this is not a purely theoretical exercise.
You can calculate statistics relative to:
Okay, so far so good. Cases 1, 2, and 4 are what the climatology
attribute is meant to handle.
I think Case 5 is also covered, although it requires quite a lot of metadata to record something that can be expressed very simply as "±15 days". (A 2xN array the full length of the original time coordinate.)
Cases 3 and 6 point out the ugly wrinkle, which is that that the operations used to construct the 'climatology' may not all be the same and can be compounded. If I want to look at the change from past to future in the standard deviation of the monthly average daily maximum temperature, which is not a tremendously exotic quantity to consider, that's an anomaly using 2 reference periods that have been aggregated in 3 different ways at 3 different frequencies. We have examples doing complex aggregations with the existing machinery in section 7.4 (though some of them are kind of hard to understand), but they're only ever for 2 aggregations.
Can we handle 3 or more aggregations, plus an anomaly-type operation where you're calculating against two different periods, and still retain all the relevant info?
Upon further thought, I think this can almost be handled by the existing machinery, but that it needs one or two additional pieces.
Consider a quantity like the standard deviation of monthly average daily maximum temperatures. I think we can clearly record that in a way that's self-contained and easy to interpret (by both humans and computers) if we add the frequency/spacing that we're aggregating to. In other words, in addition to the aggregation method, cell_methods
should also include the scale or periodicity of the result.
Currently, if you encounter time: max
, the only way you can tell whether it's a daily or a monthly maximum is by calculating the spacing of the time axis. If we add the target interval, we could record the example as something along the lines of:
cell_methods = "time: max day; time: mean month; time: stdev over 30 years"
(That syntax could also be straightforwardly extended to include moving windows, like: time: mean 31-day window
, and likewise with aggregations like quantiles.)
However, there's still one piece missing, which is combining cell_methods
with climatology
, because they are recorded separately, so you don't have the order of application. For example: suppose you had a climatology where the last element of cell_methods
was model: stdev
. Is that the standard deviation across models of the climatology, or the climatology of the standard deviation across models?
Would it make sense to consider the possibility of uniting climatology and cell_methods under a single umbrella, maybe something like aggregation
? Does what I've suggested here actually cover everything, or am I missing something?
Hello,
I would like to re-open this issue / question as I do not see that we concluded, and I do need an answer for some datasets we are preparing. I will first try to summarize what I understand the status is, then propose a way forward and ask questions.
The thread started with @sethmcg asking how to store the reference period an anomaly value was computed against. @JonathanGregory answered that the current mechanism in CF is to use the reference_epoch
standard name. He then noted that the current mechanism does not allow to record what the climatology was (how the climatology has been computed), only the period upon which the climatology has been computed. This answers seemed to cover @sethmcg use case, although he notes that other more complex cases (e.g. bias correction) might not be well covered. Seth then took part of his issue/request (i.e. how to specify the reference period for calculating KBDI) to cf-convention/vocabularies#27, that discusses the reference_epoch
solution. @TomLav then asked Jonathan if the problem with the reference_epoch
solution is that it does not allow recording the cell_methods
of the climatology (the cell_methods
is a key aspect of defining Climatological Statistics). Thomas notes that he was surprised that the way to store an anomaly (standard_name x_anomaly + reference_epoch) and the way to store climatologies (attribute :climatology) have so little in common in the CF world. Jonathan's answer was two-fold: 1) he noted that "climatology" could have two meanings in CF (the main distinction being the content of their cell_methods
), 2) he wrote down an example of a possible way to extend the convention where a dummy climatology variable could be used to store all relevant information about the climatology: not only the climatology period but also the method used to compute the climatology value, noting that this was only a suggestion. Thomas did not follow-up (sorry!). Seth then posted two comments asking (and partly answering) if a whole new concept (of aggregation
) would not be needed to encompass both the current uses of climatology/anomaly and other cases that are related (e.g. bias correction and calibration). The thread then stalled.
This thread identified (the issue at hand) that the current solution (reference_epoch
) for recording the climatology against which an anomaly variable is computed is not fully satisfactory because it does not allow to record what the climatology was (how it has been computed).
Think of the difference between a 20th September maximum temperature wrt a reference_epoch of 1990-2019: is it a difference from the 30-year (yearly) mean, from the 30-year September mean, from the 30-year mean of 20th September, or even from a 20th September maximum daily temperature over 30-year, etc...?
Although I do have questions, I think the solution suggested by Jonathan of introducing a dummy climatology variable is an elegant one. Particularly because it brings closer the definition of climatology and anomaly in the CF-world, and reuses the concept of dummy variables (e.g. thinking of grid_mapping
). I would like to explore this more, and see if we can develop it to the point where it is documented in future versions of the convention, e.g. as an addition to 7.4 Climatological Statistics.
My approach would be to start from files structures where the anomaly variable and the climatology variable are both present (in the same file). Once we decide how to link the anomaly variable to the corresponding climatology variable, we would think about how to "empty" the climatology variable to only keep its definition (and turn it into a dummy climatology variable). As far as I am aware, even the start situation (the anomaly and climatology variables both in the file and the anomaly pointing to the climatology variable) is not described in the convention today.
I admit I do not understand the implications of your final posts in the thread above, @sethmcg. It seems to me that they could lead to a much larger overhaul of 7.4 and cells methods, but correct me if I am wrong. I have no idea what the efforts would be to solve all the cases you raise in your two final posts. I would however be interested in first fixing the issue identified above (a fix for the reference_epoch
solution), rather than covering all aspects at once.
Thanks for the excellent summary, @TomLav . I'd like to comment, but will wait until we decide where. I vote for starting a new thread in "Disucssions", seeded with this summary (and linking back to here). But I don't mind.
Thanks @davidhassell . I like a idea of starting / continuing in a Discussion. Others> any other views or answers to my questions above?
Dear @TomLav
I'm also grateful for your clear and comprehensive summary. I'm sorry I didn't answer @sethmcg's two most recent contributions, because of lack of time to think about it. I agree that it is of interest to devise a solution for definite use-cases that you have, and that we don't currently have a way to distinguish the cases you describe.
I agree with @davidhassell that it would be sensible to continue this in a new Discussion.
Best wishes
Jonathan
This issue is considered to have been converted into a Discussion (although not reflected by GitHub as a new standalone Discussion post was created for the purpose). Therefore this issue will now be closed, as conversation is continuing on at the following link: https://github.com/orgs/cf-convention/discussions/305. The issue history and contributions will still be visible.
There are some variables, like
air_temperature_anomaly
andkeetch_byram_drought_index
that use a climatological variable (like monthly average air temperature or average annual precip) as an input.The climatology itself has its reference period recorded in a bounds variable referenced by the
climatology
attribute, but that information gets lost when you calculate these derived variables.Does CF has a mechanism for recording these climatological reference periods associated with non-climatological variables? And if not, do we need one?
I think the answers are "no" and "yes". Can anyone think of a way to do it within the existing rules? And if not, are there other use cases to consider?
The only thing I've been able to come up with is that if you had an anomaly-type variable with both
bounds
andclimatology
attributes, you could assume that bounds applies to the anomaly variable and climatology to its input, but that's confusing, and I think CF forbids having both attributes on a single variable anyway.Thoughts?