cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
82 stars 43 forks source link

Cell methods: "within"|"over" "days"|"months" and time axis (Section 7.4) #197

Open larsbarring opened 5 years ago

larsbarring commented 5 years ago

In section 7.4 the use of cell method constructs "within year", "over days" and similar are explained in context of climatological time axis. From this I get the impression that these constructs are only allowed if a climatological time axis. But I guess that this is not the correct interpretation?

I am asking because I have come across numerous CMIP5/CMIP6 files of monthly tasmin/tasmax suggesting that the construct can be used also in connection with a 'normal' time axis. Here is an example from a CMIP6 file

netcdf tasmax_Amon_...__201501-210012 {
    float tasmax(time, lat, lon) ;
        tasmax:long_name = "Daily Maximum Near-Surface Air Temperature" ;
        tasmax:units = "K" ;
        tasmax:cell_methods = "area: mean time: maximum within days time: mean over days" ;
        tasmax:standard_name = "air_temperature" ;

    double time(time) ;
        time:long_name = "time" ;
        time:units = "days since 1850-01-01 00:00:00" ;
        time:axis = "T" ;
        time:bounds = "time_bnds" ;
        time:standard_name = "time" ;
    double time_bnds(time, bnds) ;
        time_bnds:long_name = "time axis boundaries" ;
        time_bnds:units = "days since 1850-01-01 00:00:00" ;

I suggest it would be useful to clarify when/where the cell method constructs within | over days | years can be used.

Lars

JimBiardCics commented 5 years ago

@larsbarring I may be wrong, but I think that usage is wrong. I think the proper cell_methods should be "time: maximum". If this is a regular time sequence where the bounds for each time step are the beginning and end of each day, then there is no mean over days and the maximum is assumed to be within the bounds.

taylor13 commented 5 years ago

Since these are monthly files, I think what is being requested is the mean over all days of a month of the maximum temperature reached each day. I think the cell-methods is correct, but probably bounds should be replaced by climatology , and for the month of January 2010 and units of "days since 2010-01-01", the climatology bounds should be: 0.0, 31.0 (i.e. extending from the beginning of the first day of January 2010 to midnight of the last day of January 2010.

Note that these climatology bounds are the same as the bounds of the month itself, so I'm not absolutely sure that the bounds attribute couldn't be used (rather than climatology).

Also note that if cell_methods were set to "time: maximum", the user would expect that the value recorded would be the absolute maximum occurring during the month (rather than the mean of daily maxima), so this would be incorrect.

larsbarring commented 5 years ago

Well,it seems that the conventions text is not quite as clear as it should be, as both @JimBiardCics and @taylor13 think the example is wrong (in different ways?).

The example was aa random pick that I downloaded from ESGF CMIP6 (it was not a 'local produce'), so I I am afraid that I will have a hard time retracing my steps to find exactly which one. However, as I wrote, I have come across many CMIP5/6 files having the same metadata construct. To me it seems -- but I could be wrong -- that the example show the typical usage in CMIP5/6. So the question is, how should the CF text be interpreted, and is the typical CMIP5/6 use in line with that interpretation?

For me the interpretation of the example is not difficult in itself:

    tasmax:cell_methods = "area: mean time: maximum within days time: mean over days" ;

tells me that the data is the mean -- over some period of time given by the time bounds variable -- of the maximum within days (days having its usual interpretation of hours in the interval [00..24[ (as nothing else is explained in a comment within parentheses).

Freely admitting that I am newcomer to some of the more advanced aspects of the CF Conventions, I do not immediately see the need to have a climatology time axis in this context, besides that it seems to be required by the text in Chapter 7.4.

JimBiardCics commented 5 years ago

@taylor13 you are right! This data is, in fact, monthly average Tmax values (the CMIP "Amon" — which indicates atmospheric monthly data — in the variable name is what gives that away), so the cell_methods would be right if the bounds variable was indicated with an attribute named climatology. I agree with @larsbarring that the language in the document is not clear about this, but I've been down this road in the past and I recall being assured by a CF master (@JonathanGregory, I believe) that the "within" and "over" cell_method terms are not for use except in climatologies. It is clear from document that a climatological bounds is indicated using the climatology attribute.

In my opinion, the CF document suffers from quite a lot of imprecise language and lack of specificity for a standard. We keep running into "but that's what we meant" in a number of different areas.

There is a Trac ticket #82 that was intended to address this very question. It's still open and hasn't progressed for three years.

larsbarring commented 5 years ago

Thanks @JimBiardCics! In fact I had an email [list] conversation with @JonathanGregory a couple of year back on a related issue where Trac ticket #82 was mentioned. For my purpose back then I came up with what felt a bit like an ad hoc solution. Nevertheless there are a couple of things that keep nagging me regarding all this:

JonathanGregory commented 5 years ago

Dear Lars, Jim, Karl

I agree that this example shows that we didn't think quite carefully enough about whether or when climatology bounds are truly needed. If we have cell_methods = "time: maximum within days time: mean over days" with climatology bounds of 1850-1-1 00:00 to 1850-2-1 00:00 we mean (according to section 7.4) that "maximum within days" is applied for the time interval 00:00 to 00:00 within each day i.e. the entire day, and the values are meaned over all the days within the interval i.e. all the days in Jan 1850. If the bounds are 1850-1-1 00:00 to 1850-1-31 06:00, we mean that the maximum is calculated for each day within the interval 00:00-06:00, and the maxima are meaned over all days of the month. The rule (which I can't see stated in the text) is that the first climatology bound is the beginning of the first interval, and the second is the end of the last interval.

In the first case, the entire month is considered in calculating the statistic. As Karl said, the climatology bounds therefore mean the same thing as ordinary time bounds would do. In the second case, the two elements of the climatology bounds imply that the statistic is calculated from 31 noncontiguous time intervals viz. 00:00-06:00 on 1850-1-1, 00:00-06:00 on 1850-1-2, etc. Ordinary time bounds describe the beginning and end of a single continuous interval of time. Climatology bounds may describe a set of discontinuous intervals.

The noncontiguous case is maybe more common for the annual cycle e.g. cell_methods = "time: maximum within years time: mean over years" for 1850-1-1 to 1859-2-1. This says to calculate the maximum within the entire interval of each January from 1850 to 1859, then calculate the mean of these ten values.

Although the climatological bounds do have a different sort of meaning from ordinary time bounds in the noncontiguous case, perhaps we don't really need to use a different attribute for them. The possibility that the bounds might refer to a set of discontinuous intervals is implied by the cell_methods. Maybe we should use the presence of within/over days/years in the cell_methods as the flag for climatological time, and use the ordinary bounds attribute for climatological time bounds, since it's clear enough, as Lars says. That simplification would be backwards-compatible if we continued to allow the climatology attribute for climatological time, perhaps deprecated (so that the CF checker gives a warning).

Best wishes

Jonathan

larsbarring commented 5 years ago

Dear Jonathan,

Thanks for this explanation. I agree that the continuous and non-overlapping, or non-continuous time axis is a key here. In your first example with a continuous time axis: cell_methods = "time: maximum within days time: mean over days" and bounds 1850-1-1 00:00:00 1850-2-1 00:00:00 1850-1-2 00:00:00 1850-3-1 00:00:00 ... I (still) do not see the the necessity to require a climatology attribute for the bounds. The interpretation will be the same anyway: maximum within day and then these maximum values are averaged over the time period stated by the bounds. I think we agree on this.

In the second example the presence of the climatology attribute, or not, will make the distinction between the situation when a cell methods is applied to a set of noncontiguous time periods and the situation when same cell methods is applied to sequence of overlapping time intervals. cell_methods = "time: maximum within days time: mean over days" and bounds 1850-1-1 00:00:00 1850-31-1 06:00:00 1850-1-1 00:06:00 1850-31-1 12:00:00 1850-1-1 00:12:00 1850-31-1 18:00:00 1850-1-1 00:18:00 1850-2-1 00:00:00 ... With the climatology attribute present it is clear that the bounds specify a set of noncontiguous intervals. If the climatology attribute is not present, one might interpret the bounds as specifying a sequence of overlapping intervals. In this case the cell methods becomes difficult (impossible?) to interpret in a meaningful way. Even though the difficulty/impossibility of a meaningful interpretation of the cell method might serve to indicate noncontiguous time intervals and thus specify "climatology time bounds" even without the climatology attribute, I would suggest that it is [much] more clear and straightforward to require the climatology attribute.

Hence I would like suggest that i/ the climatology attribute is used/required when dealing with noncontiguous intervals to create a climatology (as in Jonathan's second example), as is the case in the current CF (the text might improved upon to become more specific and clear), and ii/ that within... and between ... are allowed without requiring the climatology attribute (as in Jonathan's first example), which would be in line with the straightforward interpretation that already is widely used for CMIP data

This would clarify and clean up the meaning of the climatology attribute as indicator that the time bounds specify non-contiguous time intervals.

Kind regards, Lars

JonathanGregory commented 5 years ago

Dear Lars

I agree with your analysis of the distinction and that the noncontiguous implied intervals look like they're overlapping if you don't interpret them as climatological. Nonetheless, I would go further than you. I don't think we can rely on the presence of "within" and "over" in the cell_methods to tell us it's climatological time, and we could thus get rid of the need for the climatology attribute. Do you or others see any pitfall in this simplification?

Best wishes

Jonathan

taylor13 commented 5 years ago

Dear all,

Yes, I think this is how we should have originally indicated climatologies (no need for a "climatology" attribute with the bounds). And it would appear that implementing it now as the preferred method could be made backward compatible, but any software that identifies climatological data by looking for the the "climatology" attribute would have to be updated. I hope @davidhassel will comment on whether this would be a problem for the data model.

best, Karl

JimBiardCics commented 5 years ago

I fear that the last few comments have left me a bit confused. Would someone please summarize the current understanding?

taylor13 commented 5 years ago

What is being suggested, I think, is: 1) Rely on the cell_methods attribute to indicate whether the time axis represents a climatology or not: if cell_methods includes for the "time" dimension the words "within" and "over", then the time axis must be a climatological time axis; otherwise the time axis represents a normal time axis. 2) Deprecate use of the climatology attribute and use the bounds attribute to point to the variable containing either the normal bounds for the cell or the climatological bounds for a cell.

larsbarring commented 4 years ago

Well, actually what I was suggesting was to keep the climatology attribute to clearly signal that the bounds defines a climatology computed over a set of noncontiguous time periods. I agree with Jonathan's comment that in the particular example it is unambiguous what the bounds mean iff within and over are used to signal climatology time bounds.

To me there are two reasons for keeping the climatology attribute, but disconnecting it from the use of within and over. First and foremost, climatological time bounds are conceptually very different from ordinary [overlapping or non-overlapping] continuous time bounds. As such there is value to clearly identify which type of bounds it is. And, as the CMIP example has shown, within and over are not per se related to climatologies but more generally to temporal aggregations (and aggregations in more general). Secondly, I am not sure that we have in enough detail penetrated the risk that overlapping ordinary time bounds may be the same or similar to climatological time bounds. Without having a good example at hand, there might be examples related to moving averages, which for example was mentioned in the Requirements section of #82.

To sum up, the climatology attribute already exists and has a clear meaning (noncontiguous time bounds), which is then complicated by tightly connecting within and over to this attribute, leading to that otherwise perfectly ordinary time bounds have to be declared as climatology time bounds as in Karl's issue over at github cmip6dr/CMIP6_DataRequest_VariableDefinitions. Would it not be simpler and cleaner to disconnect the use of within and over from the climatology attribute?

Kind regards, Lars

JimBiardCics commented 4 years ago

@larsbarring @taylor13 Thanks for your clarifications! I think Lars' suggestion about decoupling the use of within and over from climatology is well worth discussion, in particular if some CMIP datasets are already using the terms in cell_methods attributes for non-climatological purposes.

sebvi commented 4 years ago

Very interesting thread, thanks @larsbarring for bringing this up.

We, at ECMWF, use the construction within / over without climatology because we have use cases where it is convenient to use it.

One typical use case is "monthly means of daily means" where we define the following cell_methods attribute:

cell_methods = "leadtime: mean within days forecast_reference_time: mean over days within months"

taylor13 commented 4 years ago

From my reading of the conventions, "within" must precede "over" in cell_methods and only "days" or "years" may follow "within" or "over", so I don't think the illustration above (https://github.com/cf-convention/cf-conventions/issues/197#issuecomment-529540851) would be found in CF-compliant files.

We can discuss whether the conventions need to be extended to include the kind of description used in the ECMWF file.

It might not be obvious, but "climatology" as used in section 7.4 of the conventions extends the concept of climatology beyond the most common use case involving multiple years of data (e.g., 30-year climatology). In CF a climatology can refer to data from a single month, for example: cell_methods = "time: maximum within days time: mean over days can apply to a single month, week, season, etc. defined by the bounds set to the beginning and end of the time-period.

larsbarring commented 4 years ago

@sebvi I can see what you are aiming at with the construct ...over days within months. At several occasions I have thought along the same lines but always come to the conclusion that, as Karl writes, it is not CF compliant and that my use cases could be handled by the time bounds (but I am not working with forecasts that may involve the additional complexity of different time axes (as with your leadtime and forecast _reference_time).

@taylor13 Yes, I absolutely agree with you, that a climatology can be taken over all sorts of periods and not necessarily over years. And precisely because of this I think that without the help of a `climatology`` attribute there might be a risk for overlap and confusion between a climatology taken over a set noncontiguous time intervals, and a normal time coordinate involving overlapping intervals.

JimBiardCics commented 4 years ago

@larsbarring You are probably well aware of this, but I want to make sure we are all clear that there is nothing in CF that says that non-climatological bounds cannot overlap.

It seems to me that the use case described by @sebvi was intended to be handled using comments within the cell_methods attribute. I think the cell_methods attribute was not originally intended to capture a precise description of any part of the processing performed to obtain the values in a data variable. It was intended to be broadly notional.

As near as I can tell, within and over were added because a climatology is largely useless if you don't know how it was produced, and the lower/upper bounds formalism is not sufficient to capture the details. The cell_methods attribute was a natural choice for the place to put this information, but it may have also been an unfortunate choice, as it leads to mixing notional and precise information in a single attribute string. It is tempting to use the within and over keywords to describe general time processing, but I'm not sure this should be allowed. If it is allowed, we need to be much clearer about how to properly use them. Either way, it seems we need to make the wording of the convention more prescriptive and precise.

larsbarring commented 4 years ago

@JimBiardCics Yes, overlapping normal (non-climatological) intervals are allowed. What concerns me is if the climatology attribute is deprecated and climatology -- i.e. noncontiguous -- intervals are identified only be the presence of within / over in the cell_methods, then we might find a situation that is either ambiguous or precludes representation of certain datasets.

For example, if we have the following

cell_method = "time: mean within hours (5 minute interval) time: maximum over hours"

time bounds =
2019-1-1 00:00:00, 2019-1-2 12:00:00
2019-1-1 06:00:00, 2019-1-2 18:00:00
2019-1-1 12:00:00, 2019-1-3 00:00:00
2019-1-1 18:00:00, 2019-1-3 06:00:00
2019-1-2 00:00:00, 2019-1-3 12:00:00
2019-1-2 06:00:00, 2019-1-3 18:00:00
....

is it absolutely clear whether the data is to be interpreted as having a climatological time axis based on noncontiguous 6 hour intervals over a couple of days, or having a 'normal' time exis of overlapping 18 hour intervals?

The `presence/absence of a climatology attribute would clearly identify which. I readily admit that this is a made up (and maybe even silly) example, and I may have overlooked some aspect, but I hope that it still helps to illustrate the point that I am trying to make.

taylor13 commented 4 years ago

My understanding is that if cell_methods includes "within" and "over", then if bounds are provided, the variable containing them must currently be given by the climatology attribute. And if there is a climatology attribute, then "within" and "over" should appear in cell_methods. If this is correct, then one can determine if a variable has a climatology axis by either: 1) Checking to see if climatology is defined, or 2) Scanning the cell_methods for the strings " within " and " over ".

This means that we could, if we wanted, deprecate climatology without losing any information. Codes that currently determine whether or not there is a climatology axis by looking for the climatology attribute would, however, have to be modified to instead examine the cell_methods and look for "within" and "over".

In the example given directly above, I think it is indeed *absolutely clear" that time is a climatological axis because of the specified cell_methods.

larsbarring commented 4 years ago

@taylor13 Yes, I understand your conclusion that it is "absolutely clear" that my example should be interpreted as having a climatology axis. But would not these two alternatives for determining a climatology time axis make it impossible to use the cell methods to represent datasets involving temporal statistics over overlapping time periods? Or, if we require climatology to be present (and skip your second alternative, which is how I interpret the current text in Section 7.4) would not my example be ambiguous (and thus not fully consistent to CF)?

JonathanGregory commented 4 years ago

Dear Lars

In your example, you have within hours and over hours. The syntax of climatological cell_methods doesn't allow hours - do you mean days?

Maybe I have missed your point, but I agree with Karl that your example is clearly indicated to be climatological time by the within and over in the cell_methods. You don't need the climatology attribute to indicate it. If the intention was to indicate statistics calculated over overlapping 18-hour intervals, the cell_methods would contain just time: method, where method is one of the Appendix E methods (mean, maximum etc.), and there would be no within or over for time.

Best wishes

Jonathan

neumannd commented 4 years ago

Although the within and over keywords are listed in section 7.4 on climatologies, I used them for non-climatological data similar to @sebvi . I understand that cell_methods was historically not meant to contain detailled processing information (Thanks, @JimBiardCics for the background). However, it seems to be reasonable for me to place processing information in a human- and machine-readable format. Additionally, I am not sure where else to place certain information.

Example: We have hourly model output and want to store monthly mean values. We could write it as

time: mean (1 hour interval)

and properly set time_bnds to the monthly intervals. However, we loose the information on whether hourly mean values or hourly point values (with respect to time) were written out by the model. Writing

time: point within hours   time: mean over hours

or

time: mean within hours   time: mean over hours

clarifies the situation. In trac ticket #82 more examples such as max 8-hourly average (for ozone) are listet.


The syntax of climatological cell_methods doesn't allow hours - do you mean days?

@JonathanGregory Why are hours not allowed? The text in 7.4 reads "In the descriptions that follow we use the abbreviations y, m, d, H, M, and S for year, month, day, hour, minute, and second respectively. The suffix 0 indicates the earlier bound and 1 the latter.". Three examples follow, which use only years and days. However, there is no statement disallowing the usage of hour.

JimBiardCics commented 4 years ago

It is clear that we have multiple questions that we are wrestling with. I'm going to try to summarize them.

I think it would be best to resolve the first question before trying to deal with any of the others. I think different assumptions about the answer are leading us to talk past each other to some degree.

larsbarring commented 4 years ago

@JimBiardCics Thanks, these questions nicely summarise the issue. And I agree that we should look at them one after another. I might have one that is even more basic that your first one.

@Jonathan What I was aiming at with the mockup example cell_method = "time: mean within hours (5 minute interval) time: maximum over hours" was a situation where the 5 min data (model or observation) was first averaged within each hour, and then the maximum of these hourly means were taken for each overlapping 18h time period.

JonathanGregory commented 4 years ago

Dear all

Thanks for the summary, Jim. In response to Lars's more basic question, I would say that climatological time in CF means the time axis for climatological statistics, which are derived from corresponding portions of the annual cycle in a set of years, or corresponding portions of a range of days, or both at once (quoting text from the start of 7.4, where these ideas are introduced). Further down the same section, it says "Valid values of the cell_methods attribute must be in one of the forms from the following list". This is where the syntax using within and over is defined. There are three possible forms, referring to years, days and both, to describe the climatological annual or diurnal cycles or both together. I think this use of the word "climatological" is consistent with climate science literature.

Therefore I would say that the existing convention is already clear about what is allowed. It is not permissible to use within and over with non-climatological data, and they may only be used with years and days. If other things were permissible, the convention would include them in the list of allowed forms for cell_methods.

However, of course we can change the convention. I am in favour of ticket 82 (mentioned above by Daniel @neumannd), which made some progress three years ago and then stalled. We could revive it. This was to generalise the multiple time processing supported for climatological time by cell_methods to other periods which don't correspond to the natural cycles of year and day, such as 18 hours, in Lars's example. I'm not sure whether we'd call this general concept "climatological time" still or give it another name.

The original question relates to the climatology bounds. I think this attribute is not needed because it's implied by multiple time processing. I can't remember why we included it, but I suspect it's because at an earlier stage we hadn't thought of the general need for multiple time processing described by cell_methods. Perhaps we were thinking only of the mean climatological cycle (e.g. the time-mean January temperature), which is supported by the COARDS convention in a less informative way, and is the commonest use of the concept in practice, I imagine.

Best wishes

Jonathan

larsbarring commented 4 years ago

Jonathan, thanks for answering my basic question. I now understand that "a set of years" can contain only one year, as is the case in the CMIP example above and also the example from ECMWF. While this is obvious in a mathematical context, this interpretation was not clear to me from the introductory text of section 7.4.

To penetrate the example of CMIP time: maximum within days time: mean over days just a little bit further: it is in CF considered to be a "climatological timeseries" (with or without the climatology attribute). On the other hand the conceptually rather similar CMIP monthly mean temperature, "time: mean within day time: mean over days" is according to CF expressed as time: mean. In both cases the monthly resolution is given by the time bounds. So, in CF [attribute] sense, a 160 year timeseries of the former is a "climatology" but the latter is not a "climatology". Would it be fair to say that this distinction is mainly based on whether or not a sequence of statistical operations can be compacted into one operation? If so, I think that this has to be more clearly stated in the 7.4 text.

martinjuckes commented 4 years ago

Hello All, this area of the convention causes a lot of confusion, so I agree with suggestions that the explanations could be improved.

My provisional answers to Jim's 4 questions are:

(1) Use for non-climatological variables: yes, if you consider monthly mean Tmax as non-climatological;

(2 and 3) I’m not sure about the premise of these questions. I'll give more detail below, but I feel that the cell methods string should be used to give broad, notional information (using Jim’s words) and the climatology attribute can be used to add more detailed information.

(4) The status quo is, as Jonathan explained, clearly restricted to the 3 listed forms. I support retaining this restriction to an explicit list, but I can see the case for extending the list.

Jonathan has said that section 7.4 only applies to "climatological data", but I don't believe that monthly mean daily maximum temperature, which motivated Lars's query, is a variable would be considered as a climatological variable outside the CF Convention.

Jim has pointed out that the cell_methods string is not intended to convey precise information. This suggests to me that we might interpret a variable which has time: maximum within days time: mean over days in the cell_methods string without any further details provided in a climatology attribute as being a variable which is calculated using some form of maximum over time periods less than or equal to a day and averaged over time periods greater than a day. This might be used by someone who is using a well known climatology and doesn't understand how to calculate the required climatology variable.

Perhaps the section would be clearer if we stressed from the start that (currently) 3 forms are supported and turned these 3 supported forms of the cell methods string into 3 subsections:

7.4.1 Multi-year statistics of statistics calculated within each year

time: method1 within years time: method2 over years

Method1 is applied for portions of calendar years (the same portion in each year, the portion can be a whole year), method2 is applied over multiple calendar years.

If a climatology attribute is present, the precise periods over which method1 and method2 are applied can be obtained from the corresponding variable: method1 is applied to the time intervals (mdHMS0-mdHMS1) within individual years and method2 is applied over the range of years (y0-y1).

[examples]

7.4.2 Multi-day statistics of statistics calculated within each day

time: method1 within days time: method2 over days

[explanation and examples]

7.4.3 Nested multi-year and multi-day statistics

The redundant repetition in time: method1 within days time: method2 over days (time and days both repeated) creates the impression that there are more independent pieces of information than actually provided. Would it be clearer if we used a shorter form such as time: method1 within and method2 over days?

JimBiardCics commented 4 years ago

Getting back to the question posed by @larsbarring and his example of cell_method = "time: mean within hours (5 minute interval) time: maximum over hours" — I may be wrong, but I don't see any sense in which this example would be interpreted as a climatology. My working definition for a climatology is "a mean of a measure over a set of intervals that represent roughly equivalent parts of multiple diurnal or annual cycles." (There can be other meanings, but I think this covers the great majority of cases.)

Using this rubric:

The connection to diurnal and seasonal cycles is why days and years are currently the only valid objects for within and over. The within and over constructs are evocative, and I have felt the urge to use them to describe something other than climatology, but I'm concerned that doing so will make the intention of the terms in a cell_methods attribute even harder to understand than it already is.

martinjuckes commented 4 years ago

Hi Jim, I would be happy with that definition of what constitutes "a climatology" .. but it looks to me as though this would exclude Example 7.14 from the convention, which is monthly maximum daily precipitation totals (the same as you last example except that it is over contiguous days rather than hours, and deals with precipitation rather than temperature). Do you also see Example 7.14 as being outside your definition of a climatology?

Example 7.14 is very close to the CMIP example that Lars is asking about, which shows, I think, that the latter falls within the intended scope of section 7.4.

There are also broader interpretations of "Climatological Statistics". All the formal definitions of climatology that I could find simply define it as a synonym for "climate science". Some institutions even include variables such as monthly mean temperature under the heading of climatological data. I think this latter usage is more common in the UK than in the US, so there may be a difference in usage between the two countries.

larsbarring commented 4 years ago

Hi Martin, Jim, Yes, example 7.14 is indeed similar (in principle the same) as the CMIP example I used in the initial post, implying that it falls under section 7.4. I can (easily) accept this, but then I do think that it is confusing that monthly mean temperature is not treated in the same way (i.e. having a climatology attribute instead of time_bounds). To me this seems like an inconsistency that reduces the climatology attribute to a purely technical term informing about that a sequence of aggregations was used to create the data. But Martin also informs that in UK (and elsewhere) sometimes also monthly mean temperature is considered a climatology, which implies that CMIP monthly mean temperature (and similar) ought to have the same.

Maybe the can of worms that I seem to have opened is difficult to recan using [only] the existing CF mechanisms. Either we end up with certain CMIP5/6 datasets being not fully consistent with CF, or the CF attribute climatology either largely losing its intended connection to the everyday meaning of the word, or being made superfluous (and deprecated). To me none of these are particularly appealing, and I think that it might be difficult to find a solution by teasing out a definition of what is meant by a "climatology dataset".

JimBiardCics commented 4 years ago

@martinjuckes Example 7.14 pushes my definition pretty hard. It is based on diurnal cycles, but it takes a maximum rather than a mean over the longer time interval. I didn't, for simplicity's sake, try to stretch my definition to include different operations for the longer time interval, so that last bit doesn't bother me too much.

Having said that, I think a time series like Example 7.14 probably isn't really a climatology. If what we mean by climatology is a baseline profile that we can use to study long-term change over time, then the example fails the test, doesn't it? (If I understand correctly, a climatology could be spatial rather than temporal, but the CF convention and this discussion are about temporal climatologies.)

So you are right, we already have a conventions example of the "climatology mechanism" being used for what I believe to be non-climatological data. And if the CMIP6 dataset had used a climatology attribute rather than a bounds attribute, it would have been CF-compliant.

JimBiardCics commented 4 years ago

@larsbarring At the moment CMIP5/6 datasets that look like your original example are not CF-compliant. In fact, even if we change the convention, they won't be compliant with the version of CF they declare themselves to be following via the Conventions attribute unless we make a retroactive change to the convention.

martinjuckes commented 4 years ago

@larsbarring : I think there may be different views on compliance here. I don't understand the basis for the assertion of non-compliance from @JimBiardCics . The files in question are considered as error free by the CF Checker.

JimBiardCics commented 4 years ago

@martinjuckes This is part of the on-going imprecision problem in CF. There is no mention of the 'within x over y' formalism in any area other than the Climatological Statistics section. The intent was that this formalism is only valid in conjunction with a climatology bounds attribute. But CF doesn't say that explicitly. The 'within x over y' formalism has no meaning in a cell_methods attribute outside of the definition in the Climatological Statistics section. What cf-checker does or doesn't do is not particularly relevant.

larsbarring commented 4 years ago

@martin Maybe the following points will illustrate the problem:

  1. If we accept the current monthly mean Tmax files in CMIP5/6 as compliant, then we do not need climatology. This leads to that attribute is superfluous (and may be deprecated) as was suggested by Jonathan and Karl. However, I would be reluctant to this move, because I believe that CF is loosing an important and useful mechanism to clearly identify statistical aggregations over noncontiguous periods (i.e. what is mostly in focus of section 7.4.) Admittedly, then there is no CF compliance issue in a formal sense, but on the other hand most of section 7.4 is meaningless as we really not have the need for the climatology attribute.

  2. If we on the other hand agree to keep the climatology attribute then, according to Example 7.14 a fair number of CMIP5/6 monthly mean of Xmax files ought to have the climatology attribute. Right? At least that is what Karl concluded earleir in the discussion. If so, then this was something that the CFchecker did not catch.

  3. If 2. was right then, we end up with the conceptual problem that a monthly mean of Xmax requires the climatology attribute but not the monthly mean of Xmean. While not stricty a CF compliance issue (as clarified by Example 7.14), it certainly does not help to clarify the concepts and meaning of Section 7.4.

martinjuckes commented 4 years ago

Hi @larsbarring , I noticed the discussion you refer to, but I don't agree with the statement that accepting these files as compliant renders the climatology attribute superfluous. I've argued above that we can accept them as compliant and retain the climatology attribute. I do think that these files should have used the climatology attribute because the intention was to specify full information about the time periods used to calculate the data, but failing to provide full information is not generally a CF compliance error. CF works on the principal of defining how to present information, but does not generally insist that information must be provided except in a limited number of cases.

The CF interpretation of the tasmax files should be, I think, that the time_bnds variable contains information about the boundaries of the time intervals represented by the data, but, according to CF, it does not contain information about the sub-intervals used to create the data. This is an encoding error, because the file is intended to contain this information, but it is not a compliance error. I don't see a need to complicate the convention by saying that the variable indicated by the time.bounds attribute can have different meanings depending on what is in the cell_methods attribute of the data variable.

There is plenty of room for debate here, but it is clearly possible to interpret these files as complying with the convention as it stands and retain the climatology attribute.

larsbarring commented 4 years ago

Hi @martinjuckes , Yes, I find what you wrote constructive and going in the right direction. Some specific comments:

(1) Use for non-climatological variables: yes, if you consider monthly mean Tmax as non-climatological

Yes, I certainly agree with this. But what implications does this have regarding when/where cell method constructions like time: maximum within days time: mean over days can be used without the climatology attribute? Some text (in section 7.4?) to explain this would be helpful.

(4) The status quo is, as Jonathan explained, clearly restricted to the 3 listed forms. I support retaining this restriction to an explicit list, but I can see the case for extending the list.

Trac ticket #82 is all about this, isn't it? So, a positive outcome of this discussion might be that this ticket is revived, and brought to acceptance with any relevant new ideas from this thread taken on board.

A more detailed explanation along what you outline in your comment would go a long way towards clarifying section 7.4.

Finally, several of the recent comments mention Example 7.14. While it shows how the convention can be used, rather than how it typically would be used to help a novice user understand how to apply the mechanisms. As such it is not a particularly good example, rather it is more confusing than enlightening.

martinjuckes commented 4 years ago

Hi Lars, thanks, and well done for spotting the link with Trac ticket #82. My initial response was to try to keep that discussion separate but, on reflection, I think it is necessary and beneficial to consider it jointly with this issue. It may provide an alternative (perhaps clearer) means of expressing tasmax and Example 7.14, and make it possible to recommend avoiding the use of climatology for this class of statistic.

Concerning the multi-step operations use-case (Trac ticket #82), I support the idea of adding something to the conventions, but I would formulate it slightly differently. As it stands, time: mean (interval: 1 hour) means that data at intervals of 1 hour is used to calculate the mean, and that mean is taken over the domain specified by the bounds attribute (if present). We can extend this to a multi-step process as follows:

When interval is specified, multiple operations can be chained, with each operation being computed at the intervals specified for input into the following operation. For example, time: mean (interval: 5 minute) time: maximum (interval: 1 hour) would specify the maximum of contiguous hourly means, with each mean being evaluate from data at 5 minute intervals. For methods applied on non-contiguous sections of the axis see windows and climatologies below.

This is slightly different from the suggestion made by Jonathan in the Trac ticket: the approach would not allow the syntax to express the interval of the input data for the first method (mean in the above example).

This construct can be introduced independently of interval:

If the processing is applied over a section of the axis which differs in extent from the width of the axis interval (or the interval of the following operation if operations are chained), the width of the section can be specified using window. For example, the daily maximum of an 8-hour running mean evaluated at 1 hour intervals could be specified as: time: mean (window: 8 hours) time: maximum (interval: 1 hour).

The relevance to this ticket is that tasmax could be expressed as time: maximum time: mean (interval: 1 day) combined with a simple time axis with bounds specifying the duration of the mean. Example 7.14 could, instead of time: sum within days time: maximum over days and climatology_bounds, be expressed using time: sum time: maximum (interval: days) and bounds.

This approach is more flexible than the within/over syntax we have, and perhaps clearer? It could also be used for climatologies. Instead of time: sum within years time: mean over years for a decadal monthly climatology (the fact that it is a monthly climatology is specified in the details of the climatology_bounds variable) we could recommend time: sum (window: calendar_month) time: mean (interval: calendar_year). Here, it is necessary to be explicit about the fact that we are referring to calendar intervals rather than intervals of fixed elapsed time. The fact that years in within years refers to calendar years is a potential source of confusion, making people think they can use it to mean calendar years in the units string. I would like to deprecate the use of years to mean a calendar years.

With this approach, we would use the interval construct to specify the intervals over which methods are calculated, rather than relying on analysis of the climatology_bounds variable. The latter approach, which has been in the convention for some time, is not broken -- so there is a good argument for not changing it, but it does lack clarity (the examples need to use a non-compliant extension to CDL to illustrate how the syntax is meant to work) and flexibility, and it is a frequent source of confusion.

There would still be a use for the climatology attribute, but, I think, only for cases which fit Jim's interpretation of "climatology", for which bounds cannot be used. That is, for cases in which the operation produces results which are not on "cells" (bounds defines the boundaries of "cells"). There may be a case for using a more generic term, such as intervals, so that these features can be used on other dimensions (e.g. it would be useful, within CMIP, to be able to express the fact that some spectral data is averaged over dis-joint spectral bands).

In brief, there is the potential for all the essential characteristics of the averaging periods of a climatology to be expressed within the cell_methods string rather than relying on detailed analysis of a related climatology bounds variable. E.g. time: sum within years time: mean over years coud be a monthly climatology (monthly sums averaged over years) or it could be daily, weekly, pentads ... and the only way to find out is by extracting a data variable, converting it to iso date format and analyzing the intervals.

If this isn't shot down (it would be nice if there was a simpler solution), I suggest we discuss it at the next CF meeting.

davidhassell commented 4 years ago

Hello,

The timings and order of the breakout groups for the CF meeting next week has now been set (see http://cfconventions.org/Meetings/2020-Workshop.html), and the discussion of this issue will be on Wednesday 10 June from 17:30-19:00 UTC, in parallel with three other topics.

Thanks.

larsbarring commented 4 years ago

Notes from the 2020 CF Workshop Breakout session on Cell Methods is now available.

Highlights:

sethmcg commented 4 years ago

A use case for cell_methods involving climatology:

We have an archive of regional climate model outputs. To improve their usability for impacts users, we generate various aggregations at different frequencies. We provide daily, monthly, and seasonal timeseries data, as well as monthly and seasonal climatology data.

(To be explicit: a ten-year monthly timeseries file would contain 120 timesteps, one value for each month in sequential order; a ten-year monthly climatology file would contain 12 timesteps, each an average of the values for that month over all ten years.)

The workflow for generating these files is chained. To generate a monthly climatology, we first calculate a monthly average timeseries from a daily timeseries by averaging together daily values for each month, then calculate a monthly climatology by averaging together monthly values across multiple years. If the data variable is something like tasmin or tasmax (daily minimum or maximum temperature), some models will output that directly, but in other cases we may need to calculate it from hourly values.

The cell_methods attribute for tasmax monthly climatology thus ends up looking like this: tasmax:cell_methods = "time: maximum time: mean time: mean within years time: mean over years" ;

As best I understand the spec, this is completely CF-compliant and correct, although it requires some human interpretation to understand that it means we started with daily maximum values, averaged them to monthly values, then averaged those to monthly climatology.

The optional (interval: ) construct adds some information that could clarify that to a degree (although it would arguably be more useful to have the ending interval than the starting interval, or better yet both), but we don't use it because we are automating this workflow across a very large number of files, the tools we use are generic, and we found that it was too hard to programmatically determine the starting frequency; we ended up with a lot of mangled cell_methods when we tried.