Time points for daily climatology are not intuitive

ehogan commented 3 years ago

📰 Custom Issue

Hello :)

I have a cube containing 4 full years of daily data. Using aggregated_by, I aggregate the cube using a "day" and "month" coordinate. This produces a daily climatology with time points that start in July 🤯 (the bounds are as expected).

Mathematically, this makes sense, since the mean of, e.g. (Jan 1, 2000), (Jan 1, 2001), (Jan 1, 2002) and (Jan 1, 2003) is in July 2001.

However, scientifically, this doesn't make sense, since a scientist would expect the first time point of a daily climatology created as above to be Jan 1.

The CF conventions state (at the end of paragraph 2 in Section 7.4): "The time coordinates should be values that are representative of the climatological time intervals, such that an application which does not recognise climatological time will nonetheless be able to make a reasonable interpretation.". I'm not sure this is happening in this case.

Would it be better to set the time points in the climatology file such that they are more representative of the climatological time intervals?

rcomer commented 3 years ago

Related: #1422

ehogan commented 3 years ago

Thanks @rcomer; I did search for previous issues but apparently not very well! 😜

rcomer commented 3 years ago

Hi @ehogan, would the ability to tell it to use the max or min point help? #4295

ehogan commented 3 years ago

Thanks @rcomer; our workaround currently involves resetting the time points in the climatology cube with the time points from the first year from the input cube, so using the min point would help! :)

trexfeathers commented 3 years ago

I was thinking about this yesterday. Given Coord has the climatological attribute, this seems like a reasonably 'soluble' problem. Not sure if I have enough time though, we'll see.

trexfeathers commented 3 years ago

I've taken a brief look this morning, and here is the offending line:

https://github.com/SciTools/iris/blob/0af0c8d447bb90a4f693270ea20ffbfd1ee97f79/lib/iris/analysis/__init__.py#L2275

I had originally imagined adding a climatological parameter to Cube.aggregated_by(), but this would need to be passed down through several call 'layers' before it reached the above line. So it might instead be better to add code determining that the precursor coordinate in question looks climatological (covers multiple years but only one point within each year?), and to make sure the user expects the different behaviour via a clear docstring.

rcomer commented 3 years ago

covers multiple years but only one point within each year?

I think just "covers multiple years" would do it. I might, for example, start off with monthly means and then aggregate by season to make spring/summer/autumn/winter climatologies, each containing 3 points from each year.

IMO, what is done for aggregated_by points should also be done for collapsed points, so there is another offending line here:

https://github.com/SciTools/iris/blob/0af0c8d447bb90a4f693270ea20ffbfd1ee97f79/lib/iris/coords.py#L2009

rcomer commented 3 years ago

Or if you want to make it opt-in but don't want to pass the parameter down the "plumbing", could use a context manager instead?

rcomer commented 2 years ago

If the plan is to use the upper or lower bound as the point, it may also be worth considering #3456.

Recent discussion on Met Office Yammer indicates that climatologies loaded from pp-files have the points matching the upper bound, which does cause problems in coord-categorisation. I understand that @lbdreyer may be looking into that issue, but it would be good to make the outcome of this consistent with the outcome of that I think.

rcomer commented 2 years ago

See also #4665, which raises the same problem for monthly climatologies.

mokhodge commented 2 years ago

I also have the same bug when calculating a monthly standard deviation monthly_std = cube.aggregated_by('month_number', iris.analysis.STD_DEV)

wjbenfold commented 2 years ago

In discussion earlier, @rcomer, @jamesp and I concluded that the main blocker to fixing this is a decision as to what the behaviour should be. Broadly, we shouldn't be guessing what the user wants unless they asked us to.

An option would be to ask the user to choose / make them choose. I'm aware that this is similar to forcing them to fix it afterwards, but it should make that process quicker (and we could leave the current default in place as it's probably not a bad common sense choice when we're not working with a climatology).

Would that work for you @ehogan @mokhodge? Or is there better logic that you're aware of that we could use to derive a default that's reasonable in more circumstances?

n.b. #4295 showed an implementation of the above suggestion.

ehogan commented 2 years ago

Thanks @wjbenfold, @rcomer, @jamesp! I would be happy to specify the required behaviour when creating the climatology. Would it be possible to include a climatological example in the documentation that demonstrates how to request this behaviour, please? :)

tkcollier commented 2 years ago

@wjbenfold , @rcomer , I have found this behaviour to be quite inconvenient, but I do deal with climatologies a lot. It was worse when I was new to Iris because it was only further down the line, when trying to complete another process with my data I realised the coordinate had been effectively reversed (dec was june etc). It sounds good to put in an option to fix this. Would it also be possible to put in a warning message that let's users know that when they use this function this can happen if they don't choose otherwise?

wjbenfold commented 2 years ago

Would it also be possible to put in a warning message that lets users know that when they use this function this can happen if they don't choose otherwise?

Hi @tkcollier, I'm not sure this is a good route to take (though open to arguments to the contrary). When we add more warnings, people become less sensitised to warnings, and for a lot of people (everyone using aggregated_by on something that isn't a climatology) there will be a warning here that doesn't matter to them.

I'd hope that adding an argument to the function that lets you choose how to overcome this (with associated information in the docstring) and an example in the docs (as @ehogan requested above) would allow someone hitting this issue to follow a chain of discovery something along the lines of

"that's odd" -> checks docs for function -> learns how to fix it -> makes fix

without causing warning fatigue in those for whom it isn't relevant?

rcomer commented 2 years ago

I think @tkcollier has a good point that you don't want to find out that things are odd after you've done a lot of processing. You could conceivably have developed your code against a data set with an odd number of years - in that case the current behaviour would give you something that looked sensible in your time coordinate. Then you add a year to your data set and only now realise something is wrong.

So I think I'm leaning back toward's @trexfeathers' idea of changing the default behaviour if the operation "looks climatological". It could be configurable as well, of course.

We could add in a warning that is only thrown when the data "looks climatological", which would avoid the problem of clogging up the stderr when the warning is irrelevant. But that's basically admitting that the default behaviour isn't helpful...

mokhodge commented 2 years ago

@wjbenfold @rcomer @jamesp Specifying an opition would work for me. If it is monthly climatology then it would be fine in most use cases to just return the month time points e.g. 01, 02, 03... and the same for daily rather then providing the whole DD-MM-YYYY string.

wjbenfold commented 2 years ago

if the operation "looks climatological"

@rcomer I'm a little bit worried that guessing this would lead to hard to diagnose errors, but maybe just looking for non-contiguous ranges would do it.

just return the month time points e.g. 01, 02, 03...

@mokhodge I'm not sure what you mean here? Do you mean having a time coordinate that just gives a month but no year (and if so, how would that interact with the bounds, which have day, month and year)?

I went digging in the CF conventions some more (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#climatological-statistics) and found the examples. Based on example 7.9, it looks like the correct choice for @ehogan's original question is 16/1/2000?

ehogan commented 2 years ago

@wjbenfold example 7.9 is a seasonal climatology example; for the daily climatology example, I would expect there to be as many time points as there are days in a year (as defined by the input data), starting from Jan 1.

mokhodge commented 2 years ago

@wjbenfold I think what I meant is what @ehogan is describing. so if I give it a list of time points that have say 11 different months then I would want the climatology points to be 11 long, with a timepoint and corresponding .data for each month. If doing monthly climatology then I think the day and year returned are pretty meaningless. Although if CF conventions say it needs to return a day and year then fine but some kind of warning to describe why it has returned a time points as DD-MM-YYYY (and how they are calculated) instead of just MM would be useful as all I want to know for example is what the mean is for each month in the time series and knowing that the first month of the returned climatology corresponds to the first month in the original time series I provided it and so on.

Bonus functionality would be if I gave it a load a timeseries data say for Jan, March, October then it would return the monthly climatology with time points for Jan, March and October so Months 01, 03 and 10 rather than it returning the timepoints just months 1, 2, 3 (or 0, 1, 2 if using python indexing conventions) but actually the data for month 2 would be for March. Because I think the latter is what might happen in iris at the moment. Not sure if that makes sense - let me know.

rcomer commented 2 years ago

if the operation "looks climatological"

@wjbenfold I'm a little bit worried that guessing this would lead to hard to diagnose errors, but maybe just looking for non-contiguous ranges would do it.

I wondered about checking contiguity too, but any coordinate that doesn't have bounds is non-contiguous. Also you might average over a long contiguous time series to get an annual climatology. The simplest thing would be to just check whether the upper and lower output bounds are more than a year apart, and declare a climatology in that case. I'm racking my brains to think of an example where that would give a problematic false positive...

@mokhodge If doing monthly climatology then I think the day and year returned are pretty meaningless.

For you use-case, could you just ignore the output time coordinate and use the month number coordinate instead?

wjbenfold commented 2 years ago

I've been thinking a bit more about this, and I'm drawn to a climatological argument to Cube.aggregated_by that causes it to set the points based on the point in the first cell that's contributing. It can then set the climatological boolean on any coordinates that have been treated as such whilst it's doing the aggregation.

should also be done for collapsed points

@rcomer would you see this as meaning something needed to change for collapsed points too? I've just about got my head around climatological aggregation but can't tell if you can also do a climatological collapse.

rcomer commented 2 years ago

@wjbenfold if I only need a climatology for DJF, then I would just load all the data for December, January and February and collapse across the lot. I would expect the point of the resulting scalar time coordinate to be consistent with having done aggregated_by("season", iris.analysis.MEAN) on a complete data set.

Having said that, if we're adding an optional keyword rather than changing the default behaviour then I don't think the inconsistency would matter as much.

SciTools / iris

Time points for daily climatology are not intuitive #4098

📰 Custom Issue