Open larsbarring opened 3 years ago
Dear @larsbarring
Thank you for these detailed and thoughtful proposals.
I agree with introducing the distinction between strictly_above|below_threshold
and at_or_above|below_threshold
. Which choice would be made for aliasing the affected existing standard names?
I appreciate the need to support number of occurrence and spell lengths in other than days, but I'm not quite convinced about the generalisation for various reasons. The main one is that "days" and "hours" in this connection might not mean just durations of time, but might refer to date-times. That is, "days" when X is true might mean calendar days, not any period of 24 hours, and similarly "hours" when X is true might mean hours on the clock. If X is true between 1245 and 1315, is that one occurrence (because 30 min is less than 1 hour) or two occurrences (because it was true at some time in both 1200-1300 and 1300-1400)? Please could you clarify what is meant in common use cases? Apart from that, I'm a bit uneasy about relying on the comment "(
interval)
" in cell methods for making a critical distinction in metadata, and I don't like total
(in general) because it's vague. I wonder whether we can use a sum
method over time in cell methods for this purpose - but that depends on the answer to my first question.
For the first and last occurrences, and beginning and end of spells, it seems to me that we could generalise this nicely using cell methods. Could we have a standard name of time_with_
condition, whose units are like those of a time coordinate i.e. "time_unit since
reference"? Then the first/beginning would have "time: minimum
" in cell methods, and last/end "time: maximum
". What do you think?
Best wishes
Jonathan
Hello Lars,
Sorry that it has taken me so long to look at this. I fully agree with the need for these changes, and with the essence of your proposed solutions. I would like to suggest some alternative encodings for the same artefacts, if I may:
1. number_of_days_with_X_above|below_threshold (deprecation)
Based on this I would like to suggest five currently existing standard names (v.77) should be deprecated in favour of standard names following the pattern _total_duration_of_X_strictly_above|belowthreshold, canonical unit second_, and _total_duration_of_X_at_or_above|belowthreshold, canonical unit second. alternatively _total_duration_of_intervals_with_X_strictly_above|belowthreshold, canonical unit second, and _total_duration_of_intervals_with_X_at_or_above|belowthreshold, canonical unit second.
These new names seem to lose the essence of the physical quantity, which is a number of periods, rather than a duration of time
I think that there could be conflicts or ambiguities with using the cell method interval:
in this way, because it would not be clear if the interval related to the interval between the original data values, or had a special meaning relating to this standard name. For example, for the name "total_duration_of_intervals_air_temperature_above_threshold" we might want to record that the original data were hourly observations (interval: 1 hour
), which could conflict with our need to specify the scientific period in question (i.e.1 day).
I'm not a fan of encoding numbers in strings (via interval:
in this case) . We have to live with what we already have in the conventions, of course, but personally I would not like to see it propagating.
As an alternative, I wonder if we could get round this with a name like _number_of_periods_with_X_above_threshold
_, canonical units 1
, which must be used in conjunction with an auxiliary coordinate variable that stores the period duration. The auxiliary coordinate variable would need a new standard name, such as _period_duration
_, canonical units s
. Here, "period" and "duration" have the same usage patterns as their use in other standard names, e.g. "forecast_period", "flood_water_duration_above_threshold", etc.
- first|last_occurrence_ofX...
How about, e.g. _time_of_first_occurrence_of_period_with_X_above_threshold
_, canonical units s
, which must be used in conjunction with an auxiliary coordinate variable that stores the period duration (see 1.). Given that we already have a mechanism for storing elapsed time since a reference date, can we not just use that by setting variable units of (e.g.) days since 2021-03-01
. The description would make it clear that the variable stores dates (e.g. "reference_epoch"). This similar to what Jonathan suggested.
3.
spell_length_of_days_with_X_above|below_threshold
(deprecation)Based on this I would like to suggest that the currently existing four standard names (v.77) following the pattern spell_length_of days_withX...should be deprecated in favour of standard names following the pattern
spell_length_of_X_strictly_above|below_threshold
, canonical unit second, orspell_length_of_with_X_at_or_above|below_threshold
_, canonical unit second.
In a similar vein to 1. and 2., how about, e.g. _spell_length_of_periods_with_X_at_threshold
_, canonical units s
, which must be used in conjunction with an auxiliary coordinate variable that stores the period duration.
4.
beginning|end_of_spell_with_X_...=.
(new)Analogous to the second point there are use cases for analysing when during a period the spell begins/ends. The technical details given under point 2 applies here, thus I move directly to suggest these new standard name patterns _
beginning|end_of_spell_with_X_strictly_above|below_threshold,
canonical unit second, andbeginning|end_of_spell_with_X_at_or_above|below_threshold
_, canonical unit second.
In a similar vein to 1., 2., and 3., how about, e.g. _time_of_beginning_of_spell_length_of_periods_with_X_at_threshold
_, canonical units s
, which must be used in conjunction with an auxiliary coordinate variable that stores the period duration. The description would make it clear that the variable stores dates. This similar to what Jonathan suggested.
All the best, David
Dear Jonathan and David, Many thanks for your thoughtful and detailed (and critical :-) comments. They are all much appreciated.
Let me in this comment start with Jonathan's first question that directly relates back to the conversation in cf-convention/vocabularies#31:
I agree with introducing the distinction between strictly_above|below_threshold and at_or_above|below_threshold. Which choice would be made for aliasing the affected existing standard names?
Can we in the old standard name definitions add something like
"This standard name is deprecated in favor of ..._strictly_above|below_...
or ..._at or_above|below_...
depending on what is most suitable for the dataset at hand."
or
"This standard name is deprecated in favor of ..._strictly_above|below_...
or ..._at or_above|below_...
depending on what is most suitable for the dataset at hand. If it not possible or relevant to make a distinction between these alternatives it is suggested to use ..._strictly_above|below_...
, which is generally relevant for high-precision (floating point) data."
This can, I am sure, be written more elegantly.
I will return to the other points in the following comments.
And now over to Jonathan's second point (first question as he writes):
I'm not quite convinced about the generalisation for various reasons. The main one is that "days" and "hours" in this connection might not mean just durations of time, but might refer to date-times. That is, "days" when X is true might mean calendar days, not any period of 24 hours, and similarly "hours" when X is true might mean hours on the clock. If X is true between 1245 and 1315, is that one occurrence (because 30 min is less than 1 hour) or two occurrences (because it was true at some time in both 1200-1300 and 1300-1400)?
Indeed, I agree. In principle this problem is also present in the current standard names based on days. In the world of model data, reanalyses and similar I imagine a day is the same as a calendar day (12-12 -- 12+12). But when it comes to daily data based on manual observations it is/was common to use the morning reading of the raingauge to define daily total precipitation of the day before, e.g. the day is 12-06 -- 12+18 instead of the calendar day. Similarly, the daily maximum and minimum temperature might have been read at the afternoon reading (day is 12-18 -- 12+06). While this is rooted in the practices during the era of manual observations, it still is common practice to use the same definitions for automatic stations. This is something that I have recently been dealing with when working with model data and surface reanalyses.
As you point out, this becomes more complicated when going to higher resolution because there is no natural cycle to use as a baseline. But the reference to the natural diurnal cycle is mainly valid for maximum and minimum temperature. In case of (e.g.) precipitation one shower around midnight (or 6 o'clock in the morning) might lead to that two days reach above the threshold (e.g. 10 mm/day), whereas the same shower occurring at another time of the day would result in just one day exceeding the threshold. So, I believe this is something that depends on the definition of the climate index as such.
One concrete practical use case for hourly precipitation comes from a presentation at a workshop couple a of years ago, pdf (1.5Mb) available here, where the aim to have hourly counter-parts to many common indices was clearly expressed (slides 8, 13-17). In addition to this use case there are high-resolution radar data, e.g. at 5-minute and 15-minute resolution, which I do not myself have much practical experience from (maybe someone from the radar community could share some insights?).
I think that all this points towards the temporal resolution as a central property to describe, rather than the timing of the intervals. I realise that this may be a complication for the first|last|beginning|end
indices -- I might not have thought deep enough about this when writing the proposal. But let's cross that bridge when we come there, first let's see where we get on this part.
Edit: All this relates to my first point, on number_of_days_...
in the initial proposal. Both Jonathan and David have other more technical comments, suggestions and questions that I think are better dealt with once we have made some progress on this more principal level.
Dear @larsbarring
Thanks for your responses. On the first one, I agree that it's fine to keep the original names in their own right and not make them aliases. I would suggest recommending the old names in the case when the distinction is undefined or irrelevant (rather than recommending one of the new precise names, as you propose).
On the second one, I see that we all agree on the problem. In that case, I would prefer @davidhassell 's solution of number_of_periods_with_
X and period_duration
as the standard name for the auxiliary coordinate variable. A possible alternative that occurs to me is number_of_time_intervals_with_
X with time_interval
for the auxiliary coordinate variable, which has the advantage of using the same phrase for both standard names and echoing the interval
keyword of cell methods.
The bounds of the time coordinate variable will imply the boundaries of the periods or intervals. For example, for number of occurrences of maximum temperature in days starting at 0600, you'd expect the time bounds to be for 0600 on the first and last days considered.
Best wishes
Jonathan
On the second one, and as we agree on the general problem, let's focus on the more technical details, where I think we agree on several points:
sum
.days
vs. 1
, I here have to argue against myself by considering the example of having five intervals, each one 15 minutes, where the total duration would be 75 minutes. This would not be particularly helpful in an automatically generated plot.number_of_time_periods_with_
X for the data variable and time_period
for the auxiliary coordinate work (the finer details of semantics might escape me here)?Now, going back to the first point: I think that it might be more confusing than helpful to keep the old standard name for two reasons. Firstly, someone is producing new datasets (an analyst manually, or more automatically in a workflow). Somewhere at this stage the decision has to be made whether to use a strict or non-strict comparison. For new datasets I can see no reason why not being precise about this decision. True, for some datasets it does not make much of a difference, in which case the recommendation should be to use one of the precise ones (I suggested the strict alternative), not to us an imprecise one. Secondly, if we now introduce a set of more general standard names (wrt intervals) it would be more confusing to keep the old ones (that additionally are less precise as per previous point). As far as I understand CF is always trying to avoid overlaps and duplication of different elements. Is there a strong use case for keeping the old ones?
Dear @larsbarring
I agree that "period" is an attractive word, but I preferred "interval" because "period" also refers to the physical idea of a recurrent phenomenon in existing standard names (waves especially), which doesn't seem like quite the same thing to me. For myself, using the same word for the same concept as in cell methods comments is an advantage, rather than a possible confusion.
I accept your argument for recommending use of the precise threshold-comparing names in future, and corresponding to deprecate the existing vague ones (although they will remain in their own right, and not as aliases).
Best wishes
Jonathan
Dear @JonathanGregory
In fact I was thinking about the same interpretation of "period" as you point at; a recurring phenomena. But checking some online English dictionaries lead me to think that _time_periods_
was distinct enough from _period_
. Initially I thought that _time_interval_
is a good phrase, then I came to think about how interval is used in cell methods comments, where it is used to specify a completely different type of interval, even though they at some level of abstraction are conceptually the same (as are all "intervals"). And in datasets using the standard names we now are discussing I would expect that both types of intervals would be present. E.g. standard name element _time_interval_
plus auxiliary coordinate variable named time_interval
(with unit = "hours"
and data = 1) on the one hand, and a cell method comment stating time: maximum (interval: 1 minute)
on the other hand -- is the distinction clear enough to avoid confusion?
Dear @larsbarring
OK, I accept your argument about possible confusion with "interval". I admit to another discomfort with time_period
, which is that I think "period" always refers to time, so it's tautological. Maybe we could use plain "period" in number_of_periods_with_
X for the data variable and period
for the auxiliary coordinate, unless there is some suitable adjective which could describe this sort of "period" that we could usefully insert to be more informative. I can't think of one just now!
Best wishes
Jonathan
Hi @JonathanGregory and @larsbarring ,
Being involved in metadata standards for climate indices and indicators for quite a while, I am following your discussions but did not interact yet.
I agree with the decisions taken so far :) A short comment about time period, it is true it is identified generally as a redundancy! https://brians.wsu.edu/2016/05/25/time-period/
All the best Christian
Hi @pagecp , Thanks for the input/support and for the link, it was fun (and for me educational :-) reading some of his examples.
@JonathanGregory as time period is out I honestly do not know which is better:
_period_
reserved for recurring phenomena, and accept possible confusion by using _time_interval_
for the standard names we now are discussing_period_
for the purpose we are now discussing and ignore that this phrase is also used elsewhere in relation to recurring phenomena, thus avoiding possible confusion with the cell method comment. I guess that the auxiliary coordinate should then have the standard name period_duration
as per @davidhassell's suggestion.period
for recurring phenomena?After concluding which one to use, I think that we might have covered all aspects related to the first group, (number_of_...
) indices, from the initial proposal.
Looking up synonyms for "period", I see time_span
as a possibility. To me it sounds rather "neutral" - it's a time-interval or period without any other implication, and it's different from period
or interval
, thus avoiding those confusions. It could be used in both the standard name and the auxiliary coordinate, I think.
I think episode
could be misunderstood, as it more likely would mean the time during which X was actually true, rather than the period during which it was tested.
Hello, I've just caught up with the discussion on word choice.
I think including the word "time" in the name is good, making it explicit that we are talking about a temporal phenomenon, rather than it being implicit in another stand-alone word (like "interval" or "period"). However "number_of_time_spans_with_X" doesn't sound right to me, possibly part of my mind want to read spans
as a conjugated verb, rather than a noun.
I currently prefer number_of_time_intervals_with ...
and the auxiliary coordinate having standard name time_interval
. The fact that "interval" may also be used in a cell method is OK by me, as the cell method interval is the same type of thing as the interval in the new standard name - i.e. a range of a quantity along an axis.
Thanks, David
As I wrote before, I initially thought that number_of_time_intervals_with_...
is a good phrase, but then grew more reluctant as the time interval in the standard name is rather different than the one in the cell methods. But if both you and Jonathan think this is the best alternative I can certainly go along; I definitely do not want this to stall further progress.
As a next step I created a mockup description/definition of one of the new standard names. Basically I used the existing descriptions and just changed phrases and elements according to what we have discussed so far. The intention is not to be precise and produce a polished text, but rather to see if there are any major issues to address before moving on.
number_of_time_intervals_with_air_temperature_strictly_below_threshold Air temperature is the bulk temperature of the air, not the surface (skin) temperature. A variable whose standard name begins with "number_of_time_intervalswith..." is the count of how many of the time intervals meet the condition during the time span specified by the time coordinate bounds. It must have a scalar coordinate with the standard name "time_interval" to supply the duration of the time interval, and another coordinate variable or scalar coordinate variable with the standard name "air_temperature" to supply the threshold(s). It must have a climatological time variable, and a cell_methods entry for within time_interval which describes the processing of the air_temperature within each time interval, i.e. before the threshold is applied. A number_of_timeinterval is an extensive quantity in time, and the cell_methods entry for over days should be "sum".
The original sentence specifically use "days" , which I changed to "time_interval". As I understand it, this would require a extension of which time intervals are allowed together with within
in cell methods ; something that has been raised over and over again in different email threads (even Trac tickets if memory serves me).
I do not want this issue, which focusses on new standard names, to be totally diverted into a conversation about climatological time axis, within
/over
and allowable time intervals in cell methods. This is better handled in a separate issue. Nevertheless it would be good to here get an indication of whether all this might stall progress on the current issue.
Hi Lars,
I am worried that I've missed a big point somewhere down the line, but I thought that the introduction of the "time_interval" auxiliary coordinate variable meant that we didn't need to overload the climatological cell methods, rather that the new formulation would work well with the existing climatological cell methods.
Given that (?) I suggest a definition of:
number_of_time_intervals_with_air_temperature_strictly_below_threshold Air temperature is the bulk temperature of the air, not the surface (skin) temperature. A variable whose standard name begins with "number_of_time_intervalswith..." is the count of how many of the time intervals meet the condition during the time span specified by the time coordinate bounds. It must have a scalar coordinate or auxiliary coordinate variable with the standard name "air_temperature" to supply the threshold(s). It must have a scalar coordinate variable with the standard name "time_interval" to supply the duration of the time interval. The time span specified by the time coordinate bounds must be the sum of a whole number of the time intervals. A "number_of_timeinterval" is an extensive quantity in time, and the cell_methods entry should be "sum".
Here's an example of the number of 6 hour time intervals for which air temperature is below a threshold during April 1960:
dimensions:
lat = 73 ;
lon = 96 ;
nv = 2 ;
variables:
float n1(lat, lon) ;
n1:standard_name = "number_of_time_intervals_with_air_temperature_strictly_below_threshold" ;
n1:coordinates = "threshold time_interval time" ;
n1:cell_methods = "time: sum" ;
float threshold ;
threshold:standard_name = "air_temperature" ;
threshold:units = "degC" ;
float time_interval ;
threshold:standard_name = "time_interval" ;
threshold:units = "hour" ;
float time ;
time:bounds = "time_bounds" ;
time:units = "days since 1960-01-01" ;
float time_bounds(nv) ;
data: // time coordinates translated to date/time format
threshold = 0.0 ;
time_interval = 6.0 ;
time = "1960-04-16 00:00" ;
time_bounds = "1960-04-01 00:00", "1960-05-01 00:00" ;
Here's an example of the number of occerences of air temperature being below 0 degreesC for each of the four 6 hour time intervals of the diunral cycle during April 1960:
dimensions:
lat = 73 ;
lon = 96 ;
time = 4 ;
nv = 2 ;
variables:
float n1(time, lat, lon) ;
n1:standard_name = "number_of_time_intervals_with_air_temperature_strictly_below_threshold" ;
n1:coordinates = "threshold time_interval time" ;
n1:cell_methods = "time: minimum within days time: sum over days" ;
float threshold ;
threshold:standard_name = "air_temperature" ;
threshold:units = "degC" ;
float time_interval ;
threshold:standard_name = "time_interval" ;
threshold:units = "hour" ;
float time(time) ;
time:bounds = "time_bounds" ;
time:units = "days since 1960-01-01" ;
float time_bounds(time, nv) ;
data: // time coordinates translated to date/time format
threshold = 0.0 ;
time_interval = 6.0 ;
time = "1960-04-01 03:00", "1960-04-01 09:00", "1960-04-01 15:00", "1960-04-01 21:00" ;
time_bounds = "1960-04-01 00:00", "1960-04-30 06:00" ;
"1960-04-01 06:00", "1960-04-30 12:00" ;
"1960-04-01 12:00", "1960-04-30 18:00" ;
"1960-04-01 18:00", "1960-05-01 00:00" ;
All the best, David
Hi David,
Both your very illustrative examples (thanks for them, they are most helpful!) are about air temperature during 6 hour time intervals. The crux is what does "air temperature" mean in this context -- is it the average or something else (minimum, maximum...)? And then sampled, or simulated, at what frequency?
Dear @larsbarring and @davidhassell
I agree with David that cell_methods
, as shown in his example, can handle this situation, if air temperature is an instantaneous quantity. I would understand it to mean the number of time intervals during which the temperature was continuously above the threshold. If some other other statistical processing (mean, max, min etc.) is needed before comparison with the threshold, then we need to extend cell methods to record double time-processing of the same time dimension. This possible need has been discussed before (a long time ago, in Trac I expect). In that case we would use the interval
comment to record the original sampling before the first time-processing.
Do we need to require the time span to be a multiple of the time interval? Maybe we could say instead that the minimum time bound is the start of the first time interval, and the last time interval could be incomplete.
I can't think of another synonym to try for "interval" yet!
Best wishes
Jonathan
Hi Lars, Jonathan,
I don't see the problem with non-instantaneous quantities. If the quantity is "the number of 6-houly intervals for which the mean air temperature is strictly below the threshold for each 6-hour interval in the diurnal cycle of April", and each 6 hour mean was calculated from 1/2 hourly data: couldn't we just have the second example the same apart from a new cell methods of:
n1:cell_methods = "time: mean within days (interval: 30 minutes) time: sum over days" ;
Am I missing something? Thanks, David
Dear @davidhassell
I was too hasty in agreeing with your second example, actually. I don't think we ought to use within days
/over days
in this case, because it might not have anything to do with days or years i.e. it is not necessarily climatological. Lars might want to count the occurrences of mean temperature in 36-hour intervals, for example. At the moment double time-processing is only allowed for climatological time, but for generality we want cell_methods="
time: mean (interval:
original spacing)
time: sum"
. The first time-processing calculates the mean, the second counts the occurrences over threshold.
Best wishes
Jonathan
Dear Jonathan,
Perhaps there may have been slightly crossed purposes - I was using the climatology formulation in encode an actual climatology, rather than the original spacing. I agree with your example (time: mean (interval: original spacing) time: sum
) for capturing the original spacing in the non-climatology case, but by extension I would have though that time: mean within days (interval: original spacing) time: sum over days
works for the climatology case - it is (e.g.) the number of occurrences in April that the 06:00-to-12:00 mean is below the threshold. This clearly wouldn't work for 36 hour time intervals, but that is a feature of cell method climatologies rather than these standard names.
All the best, David
Dear David
Yes, I see. Thanks. The use of the climatological within days
and over days
would be possible for reporting the number of exceedances of mean etc. temperature, but only if the daily period divides exactly by the sub-daily interval, since the climatological day depends on the existence of a repeating daily cycle. So it could be used for four-hour or six-hour periods, but not for five-hour ones, for instance, or for any interval longer than a day. Hence I think we may need to introduce the generalisation to double time-processing. In that case, the interval
comment of the second processing, with the sum
method, would record the time interval for comparison with the threshold.
If we made this generalisation, we would no longer need the climatological day convention of cell methods (although we couldn't remove it, because of backward compatibility). The climatological year would still be needed, however, because years are of varying length, and the sub-annual periods of interest are often based on months, which are not constant time-intervals.
Best wishes
Jontathan
I wrote:
In that case, the
interval
comment of the second processing, with thesum
method, would record the time interval for comparison with the threshold
but perhaps I shouldn't have made that remark! Although this could be done, I think we agreed we thought it was unsatisfactory to record this essential information in a comment. That's why we wanted the auxiliary coordinate to do it.
With apologies for so many postings, I would like to add a suggestion, continuing the above train of thought. We could define a new cell methods syntax "name:
method over
var", where var is a scalar coordinate variable that supplies the size of the interval over which method is applied to the name dimension. This is an alternative to the interval
comment, but a scalar coordinate variable is more prominent, easier to process and better described than a comment in a string attribute - though perhaps less readable by humans.
However, this change would possibly break our principle that we shouldn't add a new way of doing something, even if it's better, when we've already got a way. That principle suggests that, after all, we should depend on the interval
comment. The new way would also allow threshold comparisons over several time-intervals to be contained in the same data variable, by using a multivalued coordinate variable instead of a scalar one for the time-interval - but that is a generalisation for which we don't have use case, as far as I know, so again it's not a valid reason.
During the CF2021 workshop breakout discussion on climate indices we concluded that it would be useful to pull out the second point of the first suggestion in the initial post:
Distinguish between strict comparisons (ie. < and >) and non-strict comparisons (i.e. ≤ and ≥), cf. New standard names for non-strict comparison with threshold cf-convention/vocabularies#31 for details.
and refer it back to cf-convention/vocabularies#31 to have this specific suggestion implemented to the relevant existing standard names independent of the outcome of this more complex issue.
This issue has had no activity in the last 30 days. This is a reminder to please comment on standard name requests to assist with agreement and acceptance. Standard name moderators are also reminded to review @feggleton @japamment
This issue has had no activity in the last 30 days. Accordingly:
Standard name moderators are also reminded to review @feggleton @japamment @efisher008
Proposer's name Lars Bärring
Date 2021-05-25
Background
Over the years the prospects for using the CF Conventions to describe various types of derived statistics (aka climate indices or climate indicators) have been recurrently discussed in CF email list threads after the extensive conversation back in 2006-2007 (cf. relevant starting point). Since back then the concept of climate indices/indicators has evolved substantially. The many CF email list threads is a sign of the recurring want to express these new concepts using the CF Conventions. However, the conversation often spread out into discussions of many different aspects with few concrete conclusions with respect to general guidance regarding how to apply the CF Conventions. In this issue I will try to collect some of the ideas and suggestions from several of these email threads.
As a result of the initial conversation in 2006-2007 the following two groups of standard names were introduced:
number_of_days_with_X_above|below_threshold
(canonical unit: 1)spell_length_of days_with_X_above|below_threshold
(canonical unit: day (sic)) While these two groups may seem rather disparate and connected only in that they employ thresholds, they are in some sense connected. This will become more clear in the following discussion regarding generalizations and extensions.Suggested generalizations/changes and extensions
number_of_days_with_X_above|below_threshold
(deprecation)These two suggestions point towards standard names following the pattern
number_of_occurrences_with_X_strictly_above|below_threshold
ornumber_of_occurrences_with_X_at_or_above|below_threshold
. However, from a user perspective there is still a problem with these constructs: the canonical unit is1
(and not day or hour). While the1
is semantically consistent with the phrasenumber of....
users are confused when confronted with this unit in automatically labeled graphs or other output, which was previously touched upon in this email list conversation, and recently resurfaced on an off-line conversation. Hence, the following suggestion:total_duration_
. A "duration" is clearly associated with a time unit, and "total" indicates that several separate events may be joined together. A data variable having such a standard name would normally have unitdays
orhours
etc. according to context and resolution of input data. But during further processing this may (accidentally) change to any other unit of duration (e.g .the canonical unitsecond
). The temporal resolution, i.e. the unit used for discretisation of the duration, must therefore be recorded in the cell_method construct(interval: T)
. This 'discretisation unit' is what basically transforms the counting operation to a summation.Based on this I would like to suggest five currently existing standard names (v.77) should be deprecated in favour of standard names following the pattern
total_duration_of_X_strictly_above|below_threshold
, canonical unitsecond
, andtotal_duration_of_X_at_or_above|below_threshold
, canonical unitsecond
. alternativelytotal_duration_of_intervals_with_X_strictly_above|below_threshold
, canonical unitsecond
, andtotal_duration_of_intervals_with_X_at_or_above|below_threshold
, canonical unitsecond
.first|last_occurrence_of_X_....
orfirst|last_interval_with_X_....
(new) Related to summing the duration above/below some threshold, there are a range of use cases or recording the first or last date/time (in the year, season, month, day,...) when the threshold was exceeded. Referring the the original standard namesnumber_of_days_with_X_...
the date/time would typically be recorded as day_of_year or similar, cf. this conversation that as far as I can judge did not arrive at a conclusion or recommendation with respect to the CF Conventions. A related earlier thread focus more the reference time, which is an important aspect for what is discussed here. The climate index/indicator data is calculated per period (year, season or month), where this period is defined in the bounds of the time coordinate of the data variable. Framed this way the date/time of the first/last occurrence is a duration since the time specified by the lower bound of the corresponding time coordinate. As such the canonical units issecond
(in practice it might beday
orhour
). In the context of climate indices/indicators the lower bound of the time coordinate is a natural 'reference time' which should be stated in the explanation of the standard name. As was suggested in the previous point the temporal resolution must be recorded in the cell_method construct(interval: T)
.Based on this I would like to suggest the following new standard name patterns
first|last_occurrence_of_X_strictly_above|below_threshold
, canonical unitsecond
, andfirst|last_occurrence_of_X_at_or_above|below_threshold
, canonical unitsecond
. alternativelyfirst|last_interval_of_X_strictly_above|below_threshold
, canonical unitsecond
, andfirst|last_interval_of_X_at_or_above|below_threshold
, canonical unitsecond
.spell_length_of_days_with_X_above|below_threshold
(deprecation) A spell is a contiguous period of T above|below threshold (such as wet/dry spell or a heat/cold wave), which in the case of climate indices typically is the longest spell during a period (year, season, month), even though one could of course think of other methods like minimum or mean, where the method is specified in thecell_method
attribute.second
. A spell length is per definition a duration and irrespective of whether the standard name is changed as suggested or not the canonical unit for a duration is seconds. Similar to the previous two points the temporal resolution must be recorded in the cell_method construct(interval: T)
.Based on this I would like to suggest that the currently existing four standard names (v.77) following the pattern
spell_length_of days_with_X...
should be deprecated in favour of standard names following the patternspell_length_of_X_strictly_above|below_threshold
, canonical unitsecond
, orspell_length_of_with_X_at_or_above|below_threshold
, canonical unitsecond
.beginning|end_of_spell_with_X_....
(new) Analogous to the second point there are use cases for analysing when during a period the spell begins/ends. The technical details given under point 2 applies here, thus I move directly to suggest these new standard name patternsbeginning|end_of_spell_with_X_strictly_above|below_threshold
, canonical unitsecond
, and *beginning|end_of_spell_with_X_at_or_above|below_threshold
, canonical unitsecond
.After that we have discussed the standard name patterns suggested here and reached consensus (hopefully we do) I will look into the existing standard names and use cases to suggest specific standard names and explanations/definitions. These explanations will contain technical details regarding cell_methods, how to specify the temporal resolution, and the relationship between unit used for duration and the reference time. In all four points above I suggest to distinguish between strict and non-strict comparisons, as well as include both "above" and "below". However, we should not add specific standard names until there is a concrete use case.
Finally, I should mention that there are two other groups of climate indices/indicators that share some aspects of those presented here. But they are sufficiently different (and more complex) in their technical details to not include them here. Instead they will be covered in separate issues (later), but I mention them here for reference. The first group is in some sense similar to those in point 1, with two important differences:
unit
is "fraction_of_year", and the threshold is a spatially varying threshold calculated as a percentile value based on a reference period. The second group is the count of all days belonging to spells of at least a certain duration, where the spell is based on a percentile threshold calculated in the same ways as for the previous group.Ping (previous conversations) @huard, @aulemahal, @zklaus, @pagecp, @japamment, @martinjuckes, @davidhassell