cf-convention / vocabularies

Issues and source files for CF controlled vocabularies
3 stars 0 forks source link

New standard names for non-strict comparison with threshold #31

Open larsbarring opened 3 years ago

larsbarring commented 3 years ago

Proposer's name Lars Bärring

Date 2021-05-05

Background There is a group of standard names based on counting number of days, or length of spell, exceeding some threshold: number_of_days_with_X_above|below_threshold and spell_length_of days_with_X_above|below_threshold.

For certain applications originally based on observational data there are however use cases for having a limited number of standard names specifying a non-strict comparison operator ( ≤ or ≥ ). This was discussed before in an email thread, and before that in an email thread referenced therein. There was no clear conclusion as far as I can tell (except possibly that the strict comparisons may be used because based on a hands-on example involving temperature it might not make such a big difference in practice.)

But I would like to come back to this for several reasons:

Currently we do have use-cases for the following standard names: number_of_days_with_air_temperature_at_or_above_threshold number_of_days_with_air_temperature_at_or_below_threshold number_of_days_with_lwe_thickness_of_precipitation_amount_at_or_above_threshold number_of_days_with_surface_snow_thickness_at_or_above_threshold number_of_days_with_wind_speed_at_or_above_threshold spell_length_with_air_temperature_at_or_above_threshold spell_length_with_air_temperature_at_or_below_threshold spell_length_with_lwe_thickness_of_precipitation_amount_at_or_above_threshold

These new standard names closely follow existing standard names with the distinctive addition of "_at_or". This means that the description of the suggested standard names can be closely patterned after the existing ones.

For the number_of_days_... this means that the minimal change is adjusting one sentence by adding the bold-faces: A variable whose standard name has the form number_of_days_with_X _at_or _below|above_threshold is a count of the number of days on which the condition X _at_or _below|above_threshold is satisfied.

Similarly, for the spell_length_... the minimal change required is: A spell is the number of consecutive days on which the condition X _at_or _below|above_threshold is satisfied. A variable whose standard name has the form spell_length_of_days_with_X _at_or _below|above_threshold must have a coordinate variable or scalar coordinate variable with the a standard name of X to supply the threshold(s).

Cf. cf-convention/vocabularies#137 for correction of a small typo, which otherwise might be copied over to the descriptions of these suggested standard names.

JonathanGregory commented 3 years ago

Dear @larsbarring

Thanks for the proposal. I think the argument and the proposed names are clear, so I support their introduction, since you have use-cases for them. Do you propose that the definitions of existing standard names which don't contain at_or should be clarified to say that they exclude equality? One might argue that this was not backward-compatible, since it might change the meaning of data already archived, if "above" or "below" had been understood to include equality. Probably it's OK, since the writers of the data must have thought of that, and didn't object or didn't think it mattered. Alternatively the existing names could all be deprecated, and replaced with aliases which indicate explicitly that equality is excluded.

Best wishes

Jonathan

larsbarring commented 3 years ago

Dear @JonathanGregory

Thank you for your comment and for pointing at possible implications for the existing standard names without at_or. I was thinking about this a bit but not enough, so you putting focus on this is appreciated. From looking back at the email thread it is seems very likely that there exist datasets using the strict standard name also where non-strict comparison were applied.

I have no clear preference whether deprecating the existing standard names based on X_above|below_ in favor of X_strictly_above|below_ (or equivalent), or adding text in the descriptions to clarify and for cross-referencing the alternative standard names. The former -- deprecation -- makes the difference clear and prevents future mistakes and confusion but is more a far-reaching change.

JonathanGregory commented 3 years ago

Thanks, Lars. Let's see what others think. There are only 10-20 of these quantities, aren't there, so it would be quite a minor change compared to some which have been made before.

larsbarring commented 3 years ago

Put this on hold awaiting outcome of cf-convention/vocabularies#14

larsbarring commented 3 years ago

The discussion in the CF2021 workshop breakout group on climate indices (ping @japamment, @zklaus, @bzah, @jesusff) concluded that it would be better to pull out the "strict/non-strict" suggestion from issue cf-convention/vocabularies#14 and back to this issue. The following is a summary of the comments in cf-convention/vocabularies#14:

Initial statement:

Distinguish between strict comparisons (ie. < and >) and non-strict comparisons (i.e. ≤ and ≥), cf. New standard names for non-strict comparison with threshold cf-convention/vocabularies#31 for details.

@JonathanGregory comments

I agree with introducing the distinction between strictly_above|below_threshold and at_or_above|below_threshold. Which choice would be made for aliasing the affected existing standard names?

and I answers:

Can we in the old standard name definitions add something like "This standard name is deprecated in favor of ..._strictlyabove|below... or ..._at orabove|below... depending on what is most suitable for the dataset at hand." or "This standard name is deprecated in favor of ..._strictlyabove|below... or ..._at orabove|below... depending on what is most suitable for the dataset at hand. If it not possible or relevant to make a distinction between these alternatives it is suggested to use ..._strictlyabove|below..., which is generally relevant for high-precision (floating point) data." This can, I am sure, be written more elegantly.

to which Jonathan responds:

Thanks for your responses. On the first one, I agree that it's fine to keep the original names in their own right and not make them aliases. I would suggest recommending the old names in the case when the distinction is undefined or irrelevant (rather than recommending one of the new precise names, as you propose).

and I comment:

I think that it might be more confusing than helpful to keep the old standard name for two reasons. Firstly, someone is producing new datasets (an analyst manually, or more automatically in a workflow). Somewhere at this stage the decision has to be made whether to use a strict or non-strict comparison. For new datasets I can see no reason why not being precise about this decision. True, for some datasets it does not make much of a difference, in which case the recommendation should be to use one of the precise ones (I suggested the strict alternative), not to use an imprecise one. <...> As far as I understand CF is always trying to avoid overlaps and duplication of different elements. Is there a strong use case for keeping the old ones?

and Jonathan answers

I accept your argument for recommending use of the precise threshold-comparing names in future, and corresponding to deprecate the existing vague ones (although they will remain in their own right, and not as aliases).

With this conversation I think we are approaching agreement and may move on to suggest concrete wordings.

larsbarring commented 3 years ago

The detailed definition of each standard name should be coordinated with issue cf-convention/vocabularies#19 *_Standard names: _threshold, allow for percentile based thresholds**.

github-actions[bot] commented 1 year ago

This issue has had no activity in the last 30 days. This is a reminder to please comment on standard name requests to assist with agreement and acceptance. Standard name moderators are also reminded to review @feggleton @japamment

github-actions[bot] commented 1 month ago

This issue has had no activity in the last 30 days. Accordingly:

Standard name moderators are also reminded to review @feggleton @japamment @efisher008