erget commented 1 month ago

Original impetus for this was https://github.com/cf-convention/discuss/issues/258 - this issue is now for the implementation!

The Conventions document could be made clearer by removing ambiguities around certain words. BCP 14 handles this in a way that is simple and clear. It's straightforward to adopt BCP 14 or to be inspired by it in such a way that users profit, similarly to how we've been inspired by Semantic Versioning without adopting it wholesale.

We believe that this can be implemented by mid-2025. As soon as we've implemented, all future pull requests will profit. We expect it will first be merged in CF-1.13.

If you want to work on this, please self-assign or ping me - that'll help us keep track. These people have participated in the discussions up till now (I may be forgetting someone, sorry!): @mraspaud @davidhassell @JonathanGregory @larsbarring @cofinoa @feggleton @DocOtak

I will keep this issue up to date as multiple PRs will likely be required in order to implement this.

Steps to complete

[x] @larsbarring will post a version of the Conventions with annotations on the BCP 14 controlled vocab as well as "extended vocab" that we should consider rewording to match BCP 14. In the hackathon we also discussed augmenting the extended vocab with "Suggest, allow, permit, forbid, prohibit".
[ ] We decide whether we want to adopt BCP 14 or simply get inspired by it. The main question at the moment is whether we want to use all caps on controlled vocab, as is REQUIRED by BCP 14. Some people like that, others aren't so sure, we should look and see how we like it.
[ ] We pen a text stating how we are using BCP 14. Are we using it wholesale? Do we extend it to additional words? Do we use it without uppercasing? @feggleton has expressed interest in contributing to this. Then potentially in parallel:
[ ] We divide up the Conventions document and check the occurrences of the controlled vocab, rewording if necessary. Probably it makes sense to gather a coalition of the willing and work in parallel, merging into a single branch. Currently there are ~1k occurrences so this is a tractable problem as long as we don't allow CF to be rewritten several times by AIs.
[ ] We develop a pre-merge action to check for use of controlled vocab and highlight that, asking the user to confirm that we're using any introduced controlled terms consistently.

The pre-merge action could be something like (very draft):

#!/bin/bash

# Are we on a pull request?
if [ -z "$GITHUB_HEAD_REF" ]; then
  echo "This script is meant to run on a pull request."
  exit 1
fi

TARGET_BRANCH=${GITHUB_BASE_REF:-main}
diff_output=$(git diff origin/"$TARGET_BRANCH"... --unified=0 --name-only)
for file in $diff_output; do
  # Get added lines in each file
  added_lines=$(git diff origin/"$TARGET_BRANCH"... --unified=0 "$file" | grep -E '^\+' | grep -vE '^\+\+\+')

  # Search for controlled vocab
  if echo "$added_lines" | grep -iE "$vocab"; then
    vocab_found=1
    echo "Controlled vocabulary found in $file:"
    echo "$added_lines" | grep -iE "$vocab"
    echo
  fi

if [ -n "$vocab_found" ]; then
  echo "Controlled vocabulary was found in your changes."
  echo "Please verify that these words are used in line with the guidelines set forth in:"
  # Would need to make this link point to the right section which doesn't exist yet!
  echo "https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_overview"
  exit 1
else
  echo "No controlled vocabulary found in added lines."
fi

erget commented 1 month ago

@sadielbartholomew welcome aboard 😊

DocOtak commented 1 month ago

I am of the somewhat strong opinion that if we want to adopt BCP 14, it SHOULD be done without extensions or as something inspired by it. I feel the point is to add the same rigor (or feeling of) that I've come to view IETF's RFCs with, and some rare OGC standards e.g. Coverage JSON. When viewing a document and I'm seeing those keywords in all caps, I know I'm dealing with BCP14 and don't need to go look at their own custom definitions. What would I say to someone who made their own data standard that was simply "inspired by" CF? Using the extended keyword list should be used only in identifying phrasing to consider modifying to use the BCP14 keywords.

Perhaps we could consider RFC 6919 next April 1.

larsbarring commented 1 month ago

@DocOtak, you do have good point regarding "cherry-picking", and I agree. At the same time I think that the whole endeavour is somewhat broader in a first phase:

Clarify [for ourselves] what we actually mean when using the words from the BCP14 and "extended BCP14" lists. Is a "should" really a should or do we in fact mean shall or must (etc.)? And then update the text so that is is clear and consistent. At the same time I agree that we should (SHOULD or MUST?) try to replace "extended BCP14" words with those from BCP14, but that might not always be possible.

The question is then what to do with any remaining "extended BCP14" words, e.g. deprecated comes to mind. Do we want to leave them as is (without a specified meaning), or do we want to somehow clarify what we mean? I think that is a next step.

And, then, a last step would be to explore ways to render the relevant words in the documents.

larsbarring commented 1 month ago

Regarding the first step ("@larsbarring will post a version of the Conventions with annotations....", there is now a html version available here:

https://github.com/larsbarring/cf-conventions/tree/my_bcp-14/BCP-14/conventions_build

Download the cf-conventions.html and open it in your browser. To get a quick first impression you can go to Chapter 4. Coordinate types. Clearly, the rendering is not indended to be used in the final document, only to highlight the words (hopefully in a color neutral way) without changing typeface or capitalization.

The specific phrase list is as follows:

BCP14 = [
    "MUST NOT", "SHALL NOT", "SHOULD NOT",
    "MUST", "REQUIRED", "SHALL", "SHOULD",
    "RECOMMENDED", "MAY", "OPTIONAL"
]
EXTENDED_BCP14 = [
    "NOT RECOMMENDED", "RECOMMENDS* NOT","RECOMMENDS*",
    "NOT PERMITTED", "PERMITTED", "PERMITS*",
    "NOT REQUIRED", "NOT REQUIRES*", "REQUIRES*",
    "CAN NOT", "COULD NOT", "CAN", "COULD", "MIGHT",
    "NOT SUGGESTED", "SUGGESTED", "SUGGESTS*", 
    "NOT ALLOWED", "ALLOWED", "ALLOWS*",
    "FORBIDDEN", "FORBIDS*", "PROHIBITED", "PROHIBITS*"
    "DEPRECATED", "HAVE TO"
]

Comments on these lists are most welcome.

larsbarring commented 1 month ago

Hm, just returning to this issue after having ticked off the first item in the to do list in the initial post, I realise that in my earlier comment I inadvertently outlined a different order for the individual steps, and in particular move up the fourth step

We divide up the Conventions document and check the occurrences of the controlled vocab, rewording if necessary ...

to be done done after the first step that I just finished.

The reason for this is that I believe that it is useful to review the text how the (EXTENDED_)BCP14 phrases are used and opportunities for firming up the language to make the intended meaning more explicit. And this could [preferably I would say] be done irrespective of whether we go all in on BCP14, or settle for "inspired by" (keeping a close eye to avoid being included in an updated RFC 6919).

What do you think @erget @mraspaud @davidhassell @JonathanGregory @cofinoa @feggleton @DocOtak @sadielbartholomew ?

erget commented 1 month ago

@larsbarring for me that's fine. Otherwise it's a bit overwhelming. Good idea.

I'm a bit late on chiming in but believe that we should, if we indeed use BCP14 as it seems likely we'll decide, use it unmodified. @DocOtak the reason that we wanted to look at the extended list of words is that by finding those words we might also identify sections of the Conventions that should be phrased in line with BCP14. As I understand it we are not considering some modified version of BCP14; however, @larsbarring kindly identified some other keywords that could help us bring the document into line with BCP14 when we set about it.

I propose that we setup a meeting soon in order to discuss the approach and divide up the review. I know time zones are hard and some people may need to chime in offline, which is fine, but would attempt to find a slot that's as amenable for all as possible. To that end I'll put a Doodle around hopefully tomorrow, I just need to find the time to set it up :)

JonathanGregory commented 3 weeks ago

Dear Daniel @erget, @larsbarring, Andrew @DocOtak et al.

Thanks a lot for producing the marked-up text, Lars. In case anyone wants to look without making their own, here's my copy of what you have produced.

As further information, I have counted the number of occurrences of each of the phrases, using your HTML classes to find them. I've put the results at the bottom of this post.

Here are some comments on the approach we might take:

I agree that we shouldn't extend the BCP vocabulary, as Daniel says, but the extended vocabulary is needed to define the task.
I believe that the definition of the BCP vocabulary is in RFC2119. Please correct me if that's not right. I notice that that section 6 of that document, "Guidance in the use of these imperatives", says "[they] must be used with care and sparingly [my italics]. In particular, they MUST only be used where it is actually required [for various reasons given]." I think that is sensible. There are 1168 occurrences of the words or phrases of interest in the present text. If we converted all of them to BCP vocabulary, there would be a lot of them, as you can see from Lars's marked-up version. The CF convention would be shouting at us quite uncomfortably and rudely with SO MANY CAPITALS.
As an indication of how many times we should use the controlled vocabulary, I suggest that it shouldn't be much more than the total number of requirements and recommendations in the conformance document, which is about 200 (more than five times less than the number of [E]BCP phrases). I suggest that on a first pass we could implement BCP14 in the text to indicate what the conformance document says, and by default not use BCP14 for anything else in the text. I would argue that indicating anything in the text as a requirement or a recommendation which is not currently in the conformance document would be a material change in what the convention actually means, and would require a conventions enhancement issue. My example below suggests we might have quite a few of them.
Having said that, I would immediately like to point out why we might want to add requirements and recommendations. In the conformance document, we presently only list those things which can in practice be checked (either by CF-checker software, or by a human). Should uncheckable recommendations and requirements be marked up in the conventions text? If so, should they also be stated in the conformance document?
At the workshop, we discussed whether we could tone down BCP14 by using small caps. It wasn't clear how to do that in asciidoc. Is it really essential to use capitals? I expect the answer is Yes, for strict compliance with BCP14, but I would argue that our objective is not to adhere to BCP14, but to improve the clarity of the CF document. I acknowledge the argument that some readers will recognise the capitals as BCP14. However, not all will, so in any case we will have to explain it at the start. If we're doing that, why not choose our own method of highlighting the BCP14 vocabulary?
BCP14 depends on typesetting to indicate when a special word is being used as controlled vocabulary. Actually, RFC2119 remarks, "These words are often capitalized." However, RFC8174, 20 years later, says only they're only special in upper case. For example, in the text from RFC2119 which I quoted, there is one "must" and one "MUST". I think this is a rather fragile convention! Also, it strikes me as inconsistent with accessibility requirements. What if you are listening to the CF convention (for example, to help you to fall asleep at night) instead of reading it? I suggest we should not use the controlled words at all except with their special meanings.
I don't think "should" is a good choice of word for a recommendation, because that word can be used, in ordinary language, for a requirement (especially if SHOUTED). Because of this ambiguity, we have sometimes needed to clarify "should" in the CF text by inserting "recommended". I find "shall" unclear as well. It can simply be a statement about the likely future. The fact that these words have BCP14 purposes does not magically prevent the ambiguity in the mind of a reader/listener.
To keep things simple, therefore, I suggest that for requirements we use "must [not]" and "required [not to]" only, and for recommendations only "[not] recommended", that we don't use these words in any other ways, and that we don't use "should" or "shall" at all in the text.
Furthermore, I suggest that we don't use BCP14 for anything which isn't a recommendation or a requirement. I don't see the need to indicate what else is possible or optional. Most of CF is optional.

Finally, as an example, here's the first four paragraphs of Sect 4, using bold for BCP14 words and italic for the extended BCP words detected by Lars. Also, I've abridged it a bit.

The commonest use of coordinate variables is to locate the data in space and time, but coordinates may be provided for any other continuous geophysical quantity (e.g. density, temperature, radiation wavelength, zenith angle of radiance, sea surface wave frequency) [...].

Four types of coordinates receive special treatment by these conventions: latitude, longitude, vertical, and time. ... We strongly recommend that a parametric (usually dimensionless) vertical coordinate variable should be associated, via standard_name and formula_terms attributes, with its explicit definition, which provides a mapping between its values and dimensional vertical coordinate values that can be uniquely located with respect to a point on the earth's surface.

Because identification of a coordinate type by its units is complicated ..., we provide two optional methods that yield a direct identification. The attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z or T [...]. Alternatively the standard_name attribute may be used for direct identification. But note that these optional attributes are in addition to the required COARDS metadata.

To identify generic spatial coordinates we recommend that the axis attribute be attached to these coordinates and given one of the values X, Y or Z. The values X and Y for the axis attribute should be used to identify horizontal coordinate variables. If both X- and Y-axis are identified, X-Y-up should define a right-handed coordinate system [...]. We strongly recommend that coordinate variables be used for all coordinate types whenever they are applicable.

Following my own suggestions, this text might become as follows. Here, I've used italic to indicate text which I have reworded in order to adopt the BCP14 words in bold for recommendations and requirements, but optional things don't have any markup.

The commonest use of coordinate variables is to locate the data in space and time, but coordinates may be provided for any other continuous geophysical quantity (e.g. density, temperature, radiation wavelength, zenith angle of radiance, sea surface wave frequency) [...].

Four types of coordinates receive special treatment by these conventions: latitude, longitude, vertical, and time. ... It is strongly recommended to use standard_name and formula_terms attributes to associate any parametric (usually dimensionless) vertical coordinate variable with its explicit definition. The definition provides a mapping between the coordinate values and dimensional vertical coordinate values that can be uniquely located with respect to a point on the earth's surface.

Because identification of a coordinate type by its units is complicated ..., we provide two optional methods that yield a direct identification. A coordinate variable can have an axis attribute, which must have one of the values X, Y, Z or T [...]. Alternatively the standard_name attribute may be used for direct identification. But note that these optional attributes are in addition to the mandatory COARDS metadata.

To identify generic spatial coordinates it is recommended to attach the axis attribute to these coordinates, with one of the values X, Y or Z. The values X and Y for the axis attribute identify horizontal coordinate variables. If both X- and Y-axis are identified, it is recommended to construct X-Y-up as a right-handed coordinate system [...]. It is strongly recommended that coordinate variables be used for all coordinate types whenever they are applicable.

It's instructive to compare this with the conformance document, which has the following requirements for this section, and no recommendations:

The axis attribute may only be attached to coordinate variables and geometry node coordinate variables (Chapter 7).
The only legal values of axis are X, Y, Z, and T (case insensitive).
The axis attribute must be consistent with the coordinate type deduced from units and positive.
The axis attribute is not allowed for auxiliary coordinate variables.
A data variable must not have more than one coordinate variable with a particular value of the axis attribute.

The only one which matches is 2! I think 3 and 5 should be stated explicitly in the text. 1 and 4 aren't necessary as requirements here, I would say, because the rules about what attributes are allowed with each kind of variable are contained in Appendix A. We should instead have a general statement that Appendix A must be followed.

There are four recommendations in the text which aren't in the conformance document: use standard_name and formula_terms for parametric vertical coordinates, use axis for spatial coordinates, make X-Y-up a right-handed system, use coordinate variables where applicable. I think the first one is omitted because formula_terms is the only way to identify a parametric vertical coordinate, so this one can't be checked, although we could check whether standard_name is there if formula_terms is. The second one could be checked if we knew it was a spatial coordinate variable by other means; I think this should be included in the conformance document. The third one could be checked, but it's omitted from the conformance document because it requires "science" in the checker, which we don't expect of it. The fourth one can't be checked, because you can't tell whether a coordinate variable ought to be there but isn't; that would require the checker to read the data-writer's mind.

I'm sure there are plenty of issues like this which will need to be tackled in adopting BCP14. If this example is typical, it will entail a spring-cleaning. Based on how long I've spent writing this example, it would take a week or two of solid work, I guess, to draft the whole thing, and further time to debate and agree on it. It's certainly not trivial!

Best wishes

Jonathan

count	type	word or phrase
254	BCP	may
196	BCP	must
126	BCP	should
71	BCP	required
70	BCP	optional
43	BCP	recommended
14	BCP	shall
12	BCP	must not
11	BCP	should not
3	BCP	shall not
147	EBCP	can
49	EBCP	could
25	EBCP	might
23	EBCP	allowed
23	EBCP	allows
15	EBCP	allow
15	EBCP	permitted
15	EBCP	recommend
11	EBCP	require
11	EBCP	requires
7	EBCP	not allowed
6	EBCP	have to
4	EBCP	not require
3	EBCP	can not
3	EBCP	not permitted
3	EBCP	recommends
2	EBCP	could not
2	EBCP	prohibited
1	EBCP	forbidden
1	EBCP	permit
1	EBCP	permits
1	EBCP	suggest
30	J	cannot
18	J	mandatory
11	J	desirable/desired/desires
9	J	acceptable/accepted

The four "J" lines are some other words we might be concerned with.

larsbarring commented 3 weeks ago

Dear @JonathanGregory

Thanks for this thorough analysis of the work ahead. I agree to all your points, that inspired a couple of further thoughts:

I agree that that we should limit the use of SHOUTING. A good thing (perhaps the main one) is that the text will become more clear and consistent so that readers better can distinguish what is requirements/recommendations and what is just plain language.
You suggest that we should have a look at the conformance document. Currently, my python code also looks through and marks the [E]BCP words therein, but no html version is produced. It would be easy to do using the same style as in the conventions document. I am not suggesting the we should implement BCP-14 in this document, but might it be helpful as a working document?
I think it would be useful to also highlight recommendations/requirements that are not possible to check. In fact, even more so because we are making these recommendations/requirements for good reasons, and if not automatically checkable it is even more important that users are are aware of them.
I am all for trying to simplifying the text to use as few alternatives as possible and reasonable.
Do you want the new words that you added, marked "J", to be added to the EXTENDED word list (this would be easy)?

Kind regards, Lars

JonathanGregory commented 3 weeks ago

Dear Lars

Thanks for your further points. I'm glad that we have similar views.

Yes, let's look at the use of vocabulary in the conformance document. We can easily make it conform better to BCP14 if we change the headings from "Requirements" and "Recommendations" to "Required" and "Recommended". Since these words are so clearly related, however (noun and adjective), maybe that's obvious.
I agree with you that we can make the conformance document more useful if we include all requirements and recommendations, even those which can't automatically be checked. That will also make the conventions and conformance documents consistent, which is good. In addition, we could add comments in the conformance document alone about how checks could be done, in the cases where it's not obvious.
Yes, I think the extra words I thought of would be useful to add to the list to be highlighted. Thanks.

Best wishes

Jonathan

DocOtak commented 3 weeks ago

@JonathanGregory @larsbarring Shortly before the CF workshop, I had attended a rather large family event with a few generations of folks at it. At some point the topic of text communication differences came up as a conversation topic, prompted by actual misunderstandings in a big family group text message. The focus was mostly on misunderstandings caused by ellipsis (...). At my age, I was right around the cut off for folks who interpreted ellipses as a pause in thought, and disinterest or even dismissive statements. We only briefly talked about all caps. But the general conclusion was that, for many folks younger than me it was only interpreted as shouting if and entire sentence was in all caps, a single word in the middle of a sentence indicated importance or emphasis.

Basically, I wouldn't worry at this point in the process about too many all cap words shouting at you. I would wait until when we are further along and there are too many keywords that are causing confusion.

JonathanGregory commented 3 weeks ago

Dear Barna

Whether or not we call it "shouting", I find text uncomfortable to read if there are significant numbers of words with capitals in it. I don't know why this is, but I would rather avoid it by using gentler kind of markup for significant words.

Best wishes

Jonathan

erget commented 2 weeks ago

I just wanted to chime in to say that I absolutely love @JonathanGregory 's idea of the CF Conventions as an audiobook. That would solve my insomnia in an instant! 😆

It sounds like we're indeed converging toward a good approach moving forward - we've got the rendered texts and some analysis performed on them, and my impression is that a spring cleaning is plausible and beneficial, as well as desired on the part of several people here. Status from where I'm standing:

We've decided that we want to use BCP 14 unmodified, and that the "EBCP" words are to help us ensure that we're actually saying what we mean. The rendering is still an open issue, although I believe we can address that a bit further down the road.
We're approaching a point where it would make a sense to schedule a meeting to coordinate the work amongst those who are interested. Do you share this view?

If we think we're at the point where we can start dividing up the work, I'd propose that we meet and divide up chapters (it's clear that some people will need to contribute offline as it'll be hard to find an appointment that works for everybody, but I would plan it as accommodatingly as possible). We discuss how we want to go about the mechanics of the work - I would propose we agree who works on what chapters and all make pull requests associated with this issue, mark them WIP, find the keywords and make case-by-case decisions about them, and then reconvene in orer to go through the changed material together until we're happy with what we write. Possibly repeat a few times if we can't do it all in one round. Thoughts?

If we go down that road I'd furthermore propose putting the draft pre-merge job in place so that already we're scrutinising our use of keywords before merging. That means that we hopefully won't introduce inconsistencies as we work on the next version of CF, and that we have to finish the BCP-ification before the next release.

Perhaps it makes sense to start as soon as 1.12 is released so we start from a fresh release without having to deal with complicated branching and merging in that context.

larsbarring commented 2 weeks ago

I agree to the overall strategy, which seems very sensible. But I think we agreed to not work on the current 1.12(draft) version, which is what is now rendered, and wait until we can sink our teeth into a fresh 1.13(draft). This will not happen until sometime early December.

We can already now schedule a meeting for after 1.2 has be released, or are there reasons for to start preparing already before the release? I think that some, in particular Jonathan, are quite busy working to get stuff into 1.12, as well as being involved in handling the current flood of CMIP7 standard name requests.

erget commented 2 weeks ago

@larsbarring fully agree, you've stated why I was trying to express - namely that we begin work after 1.12 has been released, although we can already schedule a start that will take place in the future well in advance of that! I don't want to get in the way of the release preparations.

larsbarring commented 2 weeks ago

@erget Please go ahead and set up a call for possible dates/times (I think that your schedule is more limiting than mine :-)

davidhassell commented 2 weeks ago

Count me in to that meeting!

JonathanGregory commented 2 weeks ago

I'm interested as well

taylor13 commented 2 weeks ago

I'm supportive but with no time for real contribution, so will not attend.

erget commented 2 weeks ago

Ok all, please chime in by next Monday here and I'll setup the meet!

erget commented 1 week ago

Hahaha... Only @larsbarring indicated availability so maybe it'll just be the 2 of us. @JonathanGregory , @davidhassell , I'll also put you on the invite - if anybody else wants to @DocOtak @sadielbartholomew I'll add you as optional in case you'd like to join too :)

erget commented 1 week ago

Although! I've been on autopilot. Apologies. Of course it only makes sense to have the meeting after the 1.12 release - so I'll set a reminder to setup a new survey in mid-December. I think my thought process was

propose work meeting for mid-December, go away and forget everything
come back to find that others think it's a good idea, so set a reminder to setup a survey, forget everything
setup a survey for the near-term since I forgot everything
here we are now

My mistake. Those who were tagged in the previous comment - I'll tag you again next month when it makes sense to start.

cf-convention / cf-conventions

Use BCP 14 or inspiration of that in Conventions document #546

Steps to complete