cf-convention / discuss

A forum for proposing standard names; and any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.
42 stars 6 forks source link

Reference periods for variables derived from climatology #252

Open sethmcg opened 1 year ago

sethmcg commented 1 year ago

There are some variables, like air_temperature_anomaly and keetch_byram_drought_index that use a climatological variable (like monthly average air temperature or average annual precip) as an input.

The climatology itself has its reference period recorded in a bounds variable referenced by the climatology attribute, but that information gets lost when you calculate these derived variables.

Does CF has a mechanism for recording these climatological reference periods associated with non-climatological variables? And if not, do we need one?

I think the answers are "no" and "yes". Can anyone think of a way to do it within the existing rules? And if not, are there other use cases to consider?

The only thing I've been able to come up with is that if you had an anomaly-type variable with both bounds and climatology attributes, you could assume that bounds applies to the anomaly variable and climatology to its input, but that's confusing, and I think CF forbids having both attributes on a single variable anyway.

Thoughts?

davidhassell commented 1 year ago

No thoughts as yet, but it reminded me of #64 and #82, which I have only skimmed to remind me that they were talking about similar themes.

JonathanGregory commented 1 year ago

Dear @davidhassell and @sethmcg

David, thanks for reminding us about issue 64, which does seem to be the answer to the question - would you agree, @sethmcg, or have I misunderstood? In fact, issue 82 mentions air_temperature_anomaly as being one of the quantities which might use reference_epoch. The suggestion is to record the bounds of the climatology as the bounds of a size-1 or scalar coordinate variable with standard name of reference_epoch, which is now in the standard table. Its description is

The period of time over which a parameter has been summarised (usually by averaging) in order to provide a reference (baseline) against which data has been compared. When a coordinate, scalar coordinate, or auxiliary coordinate variable with this standard name has bounds, then the bounds specify the beginning and end of the time period over which the reference was determined. If the reference represents an instant in time, rather than a period, then bounds may be omitted. It is not the time for which the actual measurements are valid; the standard name of time should be used for that.

For example

variables:
  float delta_tas(time);
    delta_tas:standard_name="air_temperature_anomaly";
    delta_tas:units="degC";
    delta_tas:coordinates="climatology";
    delta_tas:cell_methods="time: maximum";
  double time(time);
    time:standard_name="time";
    time:units="days since 2023-7-16";
    time:bounds="time_bounds";
  double time_bounds(time,two);
  double climatology;
    climatology:standard_name="reference_epoch";
    climatology:units="days since 1990-1-1";
    climatology:bounds="climatology_bounds";
  double climatology_bounds(two);
data:
  time_bounds=0,1, 1,2, 2,3, 3,4;
  climatology_bounds=0,10957;

indicates that delta_tas contains daily maximum temperatures for 16th-19th July 2023 expressed as anomalies with respect to the climatology of 1990-2019. Is that what we need?

It occurs to me that it's not recorded what sort of climatology this was: is it annual means, monthly means, monthly maxima, maxima for the day of the year? This would be apparent from the cell_methods of the climatology, which has not been preserved.

Best wishes

Jonathan

sethmcg commented 1 year ago

Ah, perfect! Yes, that's exactly what I was looking for. Thanks much @davidhassell for the pointer to those issues and @JonathanGregory for the excellent example usage!

I note that #82 is still open, so I could go jump in there and add a couple more standard names for consideration to have that reference epoch sentence added to their description. That might also jumpstart the conversation and move the issue forward.

For the specific use case that prompted this issue (reference period for calculating KBDI), I can follow Jonathan's example and we're all set. (And that applies regardless of whether and when we update thestandard_name description for KBDI to mention reference_epoch.)

However, I have thought of two use cases for other data that I work with that have me wondering whether we need to add something to the Conventions about reference epochs in general, and not just have it mentioned in the description of certain standard names.

First: bias-correction. When you apply a bias-correction to model output, you're doing some kind of transformation (maybe simple, maybe hideously complex) to make your model dataset look (statistically) more like your observational dataset. That observational dataset has some finite temporal coverage, and reference_epoch seems like just the thing to capture that information. However, this could be applied to any variable, not just the anomaly variables discussed in #82. So that seems to warrant some mention somewhere outside the Standard Name Table. This is also a case where you might have a reference_epoch that was an array of values (e.g., if you've done the bias-correction on a monthly basis), which is something @martinjuckes mentions in #82, and which probably deserves more fleshing-out than the description of reference_epoch in the SNT.

Second -- and I've actually been doing this recently -- if you're looking at something like the change in average zg500 anomaly on rainy days, you want to track two reference periods: one for the past and one for the future. So do you need two such variables with different names? Do you need to add a categorical dimension (past, future) to the variable? Something else entirely? In any case, I think it warrants some consideration.

What's the best way to proceed from here? Close this ticket and bring these issues up in #82? Open a new ticket with a better name? Keep the discussion here going and cross-reference it over there?

larsbarring commented 12 months ago

Hi @sethmcg I think it would be most welcome if you would be able to update and move #82 forward. And I agree that @martinjuckes comment on having multiple reference periods is a relevant one.

However, for you second use case, do you really need two reference periods? Wouldn't it be enough to make the first period the reference_period, and then use the bounds of the anomaly variable to record the extent of the second?

Regarding your last question, would it be possible to split your ideas an suggestions so that what is directly relevant for #82 goes there, and then we can we can continue the the broader conversation here?

Regarding your first use case (on bias adjustment), I have been thinking about this for some time now and the reference period (reference_epoch) is one component, but there are other equally important ones, like what dataset is used as reference.

JonathanGregory commented 12 months ago

I've added the FAQ label to this issue to mark it for inclusion in the FAQ when we have finished answering it.

sethmcg commented 12 months ago

@larsbarring I like your suggestion to move the issues of the standard names to #82 and continue the general discussion here. I will do that.

You're right that for the second use case, a reference_period combined with climatology bounds on the anomaly variable captures the two periods, so I think we can regard that one as covered. Thanks!

On the more complex topic of bias-correction, one point we could start with is whether the different components can be separated, or whether a solution should encompass all of them. I think they can and should be separated, so that e.g., we address the issue of the reference_epoch independently of how we address the issue of referring to the reference dataset, which is in turn addressed separately from the question of how data was divided up by month/season/moving window. That allows you to use only the necessary elements and omit those that don't apply (and may not even make sense), and also makes it easy to extend the convention in the future as new requirements come on the scene. What that implies, though, is that we need some way to tie all of those elements together and identify them as capturing information about the bias-correction (or whatever other higher-level processing applies). The first solution that springs to mind is something like grid_mapping, where you have a dummy variable that you attach all the attributes to but that has no actual data values, and that is referred to by an attribute on the main data variable. Would it make sense to try and extend that pattern? Thoughts?

@JonathanGregory Thanks. The FAQ does seem like the right place to put that explanation. Is there anything in particular we should make sure to address to generate a good Frequently Provided Answer?

JonathanGregory commented 12 months ago

Dear @sethmcg

The FAQ does seem like the right place to put that explanation. Is there anything in particular we should make sure to address to generate a good Frequently Provided Answer?

No, there are no guidelines for this. Anything which you, as a questioner, might find useful in the answer! We haven't been adding to the FAQ, and I feel that it would be useful to do so.

Best wishes and thanks

Jonathan

TomLav commented 11 months ago

Dear @JonathanGregory ,

In your answer Jul 19th you note:

It occurs to me that it's not recorded what sort of climatology this was: is it annual means, monthly means, monthly maxima, maxima for the day of the year? This would be apparent from the cell_methods of the climatology, which has not been preserved.

Isn't the issue that the _referenceepoch variable is tasked to record the time period, and not how the reference value (aka climatology) was computed?

If we look at the way CF allows to store climatologies we see that the :cell_methods attribute to describe how the climatology was computed is in the variable storing the climatological value, while the time period on which the climatology is computed is encoded in the time variable with attribute :climatology.

Should the variable holding the anomaly also stores how the climatology it refers to was prepared? For example a _reference_cellmethods, in addition to the _referenceepoch?

At a more general level, I am surprised that the way to store an anomaly (standard_name x_anomaly + reference_epoch) and the way to store climatologies (attribute :climatology) have so little in common. Anomalies and climatologies are not the same thing, but their descriptions in the CF world could probably be more streamlined.

Best wishes, Thomas

JonathanGregory commented 10 months ago

Dear Thomas @TomLav

Thanks for your comment. It might be helpful to distinguish two meanings of "climatology". (1) The reference data variable that was used to compute the anomalies stored in another data variable. (2) A data variable which has a climatological time dimension.

The climatology attribute (instead of the bounds attribute) is used to name the bounds variable of a climatological time variable, in sense (2). It should be used if and only if the cell_methods entry for that dimension shows that's climatological time, as described in Sect 7.4 i.e. cell_methods has within and over in it. This is to describe statistics for variation within the diurnal or the annual cycle, over many repetitions of the cycles.

The climatology in sense (1) doesn't haven't to be a climatology in sense (2), but it can be. For instance, you could calculate anomalies for monthly means wrt a 30-year mean or wrt 30-year monthly means. The latter is a climatology in sense (2) and the former is not.

The solution we've discussed in this issue proposes to use the reference_epoch standard name for a scalar time dimension, to record the time bounds of a climatology in sense (1). That's useful information, but it isn't necessarily enough. For example, if you have a 20th September anomaly wrt a reference_epoch of 1990-2019, does that mean it's a difference from the 30-year mean, from the 30-mean September mean, or from the 30-year mean of 20th September? In the cell_methods of the reference climatology, the first one is not climatological time in sense (2). The cell_methods would have just time: mean for the 30-year mean. The other two cases are both time: mean within years time: mean over years and the reference epoch time bounds are the same for both, namely midnight on 1st January 1990 and midnight on 1st January 2020.

If the above is correct (I'm not sure it is!) then to be clear about the climatology that was used to calculate the anomalies, we need the time bounds and the cell_methods of the climatology (in sense 1). We could store them as attributes of the anomaly data variable, but I think it would be neater to put them in a dummy variable, which stands for the climatology, but without the data. Then it could have all its metadata, which might be useful. For instance:

variables:
  float delta_tas(time);
    delta_tas:standard_name="air_temperature_anomaly";
    delta_tas:units="degC";
    delta_tas:coordinates="reference_epoch";
    delta_tas:cell_methods="time: maximum";
    delta_tas:ancillary_variables="climatological_tas";
  double time(time);
    time:standard_name="time";
    time:units="days since 2023-7-16";
    time:bounds="time_bounds";
  double time_bounds(time,two);
  double reference_epoch;
    reference_epoch:standard_name="reference_epoch";
    reference_epoch:units="days since 1990-1-1";
    reference_epoch:bounds="reference_epoch_bounds";
  double reference_epoch_bounds(two);
  float climatological_tas;
    climatological_tas:standard_name="air_temperature";
    climatological_tas:units="degC";
    climatological_tas:cell_methods="climatology_time: mean within years climatology_time: mean over years";
  double climatology_time(climatology_time);
    climatology_time:climatology="climatology_time_bounds";
    climatology_time:units="days since 1990-1-1";
  double climatology_time_bounds(climatology_time,two);
data:
  time_bounds=0,1, 1,2, 2,3, 3,4;
  reference_epoch_bounds=0,10957;
  climatology_time_bounds=0,10623, 31,10651, ...

In this example, I've kept the reference_epoch variable but it's renamed. I've introduced climatological_tas as an ancillary variable of delta_tas. It is dummy climatology variable which has the cell_methods of the climatology but no data. It stands for a data variable which actually had dimensions, something like climatology_tas(climatology_time,latitude,longitude). The time axis of the climatology is climatology_time, which is identified by being named in the cell_methods attribute. Its bounds climatology_time_bounds are climatological (in sense 2). (0,10623) is 1st January 1990 to 1st February 2019, (31,10651) is 1st February 1990 to 1st March 2019, etc. From this we know that the daily maximum temperature anomalies have been calculated wrt monthly mean climatological temperature.

Does this makes sense and would it be sufficient? It's just a suggestion.

Best wishes

Jonathan

sethmcg commented 10 months ago

I hate to muddy the waters, but the more I think about this problem, the messier it gets.

I started working on an example to move #82 forward (before @JonathanGregory beat me to it above) and realized that there are a number of different cases we need to consider, and I'm not sure whether or not the proposed solution can handle all of them. (Plus, as @TomLav points out, there's a bit of a disconnect between how we store the anomaly / derived value and the reference it's relative to.) So maybe it will be useful to spell them out, so that we can at least be clear on the cases we need to address. I have actual use cases for all of them, so this is not a purely theoretical exercise.

You can calculate statistics relative to:

  1. a single global reference value. Example: global average temperature anomaly relative to the 1951-1980 global average.
  2. an "overall" climatology. This is like (1), but averaging only over time, not space: in each gridcell, you average over all values in the reference period. Example: decomposing data into climatology + anomaly.
  3. an annual climatology. Like (2), but summing (or applying some other aggregation to) all values in each year, and then averaging over all years in the reference period. Example: KBDI, which uses annual average precip as an input.
  4. a monthly climatology. You average over each month, then average the months across years, so that you have a climatology for each month. Example: a number of bias-correction methods.
  5. a moving window. Like (4), but instead of going by month, you use e.g. a 31-day moving window where each day is adjusted by pooling the 15 values on either side of that day across multiple years.
  6. As above, but everything is with regard to the daily cycle instead of the annual cycle.

Okay, so far so good. Cases 1, 2, and 4 are what the climatology attribute is meant to handle.

I think Case 5 is also covered, although it requires quite a lot of metadata to record something that can be expressed very simply as "±15 days". (A 2xN array the full length of the original time coordinate.)

Cases 3 and 6 point out the ugly wrinkle, which is that that the operations used to construct the 'climatology' may not all be the same and can be compounded. If I want to look at the change from past to future in the standard deviation of the monthly average daily maximum temperature, which is not a tremendously exotic quantity to consider, that's an anomaly using 2 reference periods that have been aggregated in 3 different ways at 3 different frequencies. We have examples doing complex aggregations with the existing machinery in section 7.4 (though some of them are kind of hard to understand), but they're only ever for 2 aggregations.

Can we handle 3 or more aggregations, plus an anomaly-type operation where you're calculating against two different periods, and still retain all the relevant info?

sethmcg commented 9 months ago

Upon further thought, I think this can almost be handled by the existing machinery, but that it needs one or two additional pieces.

Consider a quantity like the standard deviation of monthly average daily maximum temperatures. I think we can clearly record that in a way that's self-contained and easy to interpret (by both humans and computers) if we add the frequency/spacing that we're aggregating to. In other words, in addition to the aggregation method, cell_methods should also include the scale or periodicity of the result.

Currently, if you encounter time: max, the only way you can tell whether it's a daily or a monthly maximum is by calculating the spacing of the time axis. If we add the target interval, we could record the example as something along the lines of: cell_methods = "time: max day; time: mean month; time: stdev over 30 years"

(That syntax could also be straightforwardly extended to include moving windows, like: time: mean 31-day window, and likewise with aggregations like quantiles.)

However, there's still one piece missing, which is combining cell_methods with climatology, because they are recorded separately, so you don't have the order of application. For example: suppose you had a climatology where the last element of cell_methods was model: stdev. Is that the standard deviation across models of the climatology, or the climatology of the standard deviation across models?

Would it make sense to consider the possibility of uniting climatology and cell_methods under a single umbrella, maybe something like aggregation? Does what I've suggested here actually cover everything, or am I missing something?

TomLav commented 3 months ago

Hello,

I would like to re-open this issue / question as I do not see that we concluded, and I do need an answer for some datasets we are preparing. I will first try to summarize what I understand the status is, then propose a way forward and ask questions.

(Partial) summary

The thread started with @sethmcg asking how to store the reference period an anomaly value was computed against. @JonathanGregory answered that the current mechanism in CF is to use the reference_epoch standard name. He then noted that the current mechanism does not allow to record what the climatology was (how the climatology has been computed), only the period upon which the climatology has been computed. This answers seemed to cover @sethmcg use case, although he notes that other more complex cases (e.g. bias correction) might not be well covered. Seth then took part of his issue/request (i.e. how to specify the reference period for calculating KBDI) to #82, that discusses the reference_epoch solution. @TomLav then asked Jonathan if the problem with the reference_epoch solution is that it does not allow recording the cell_methods of the climatology (the cell_methods is a key aspect of defining Climatological Statistics). Thomas notes that he was surprised that the way to store an anomaly (standard_name x_anomaly + reference_epoch) and the way to store climatologies (attribute :climatology) have so little in common in the CF world. Jonathan's answer was two-fold: 1) he noted that "climatology" could have two meanings in CF (the main distinction being the content of their cell_methods), 2) he wrote down an example of a possible way to extend the convention where a dummy climatology variable could be used to store all relevant information about the climatology: not only the climatology period but also the method used to compute the climatology value, noting that this was only a suggestion. Thomas did not follow-up (sorry!). Seth then posted two comments asking (and partly answering) if a whole new concept (of aggregation) would not be needed to encompass both the current uses of climatology/anomaly and other cases that are related (e.g. bias correction and calibration). The thread then stalled.

(Suggested) way forward for this thread

This thread identified (the issue at hand) that the current solution (reference_epoch) for recording the climatology against which an anomaly variable is computed is not fully satisfactory because it does not allow to record what the climatology was (how it has been computed).

Think of the difference between a 20th September maximum temperature wrt a reference_epoch of 1990-2019: is it a difference from the 30-year (yearly) mean, from the 30-year September mean, from the 30-year mean of 20th September, or even from a 20th September maximum daily temperature over 30-year, etc...?

Although I do have questions, I think the solution suggested by Jonathan of introducing a dummy climatology variable is an elegant one. Particularly because it brings closer the definition of climatology and anomaly in the CF-world, and reuses the concept of dummy variables (e.g. thinking of grid_mapping). I would like to explore this more, and see if we can develop it to the point where it is documented in future versions of the convention, e.g. as an addition to 7.4 Climatological Statistics.

My approach would be to start from files structures where the anomaly variable and the climatology variable are both present (in the same file). Once we decide how to link the anomaly variable to the corresponding climatology variable, we would think about how to "empty" the climatology variable to only keep its definition (and turn it into a dummy climatology variable). As far as I am aware, even the start situation (the anomaly and climatology variables both in the file and the anomaly pointing to the climatology variable) is not described in the convention today.

I admit I do not understand the implications of your final posts in the thread above, @sethmcg. It seems to me that they could lead to a much larger overhaul of 7.4 and cells methods, but correct me if I am wrong. I have no idea what the efforts would be to solve all the cases you raise in your two final posts. I would however be interested in first fixing the issue identified above (a fix for the reference_epoch solution), rather than covering all aspects at once.

Questions:

  1. Do we have a shared understanding of this thread so far (my partial summary)?
  2. Do we agree that we have identified a shortcoming in the way CF currently handles anomalies and their link to climatologies?
  3. Would it be of interest to try and propose a solution for that particular issue?
  4. Would it be best to continue the discussion in this thread, or start in another one?
  5. @sethmcg do you want to move your questions forward in parallel? Or do we first try and "fix" the first issue at hand?
davidhassell commented 3 months ago

Thanks for the excellent summary, @TomLav . I'd like to comment, but will wait until we decide where. I vote for starting a new thread in "Disucssions", seeded with this summary (and linking back to here). But I don't mind.

TomLav commented 3 months ago

Thanks @davidhassell . I like a idea of starting / continuing in a Discussion. Others> any other views or answers to my questions above?

JonathanGregory commented 3 months ago

Dear @TomLav

I'm also grateful for your clear and comprehensive summary. I'm sorry I didn't answer @sethmcg's two most recent contributions, because of lack of time to think about it. I agree that it is of interest to devise a solution for definite use-cases that you have, and that we don't currently have a way to distinguish the cases you describe.

I agree with @davidhassell that it would be sensible to continue this in a new Discussion.

Best wishes

Jonathan

TomLav commented 3 months ago

I have now opened a Discussion where we can follow this up : #305 .