JonathanGregory commented 4 years ago

Title

State the principles for design of the CF conventions

Moderator

@davidhassell

Moderator Status Review [last updated: YY/MM/DD]

Brief comment on current status, update periodically

Requirement Summary

To state the principles which should be borne in mind when designing proposed changes to the CF convention. A brief statement of them was published by Hassell et al. (2017, 10.5194/gmd-10-4619-2017).

Technical Proposal Summary

Add a statement of the principles near the start of the conventions document.

Benefits

Proposers of enhancements to the convention will be made better aware.

Status Quo

These principles have been applied throughout the history of CF and often mentioned in discussions, but have not been written down in the CF standard or elsewhere on the website up to now.

Detailed Proposal

Insert a new section 1.2 in the conventions document, entitled "Principles for design of the CF conventions" (following section 1.1 on "Goals"), and renumber the following sections. The text of the proposed new section is:

The following principles are followed in the design of these conventions:

In order to make CF-netCDF files self-describing, no external resources are needed to interpret CF-netCDF metadata.
The conventions are changed only as actually required by common use-cases, and not for needs which cannot be anticipated with certainty.
The conventions should be practicable for both producers and users of data.
The metadata should be both easily readable by humans and easily parsable by programs.
To avoid potential inconsistency, the conventions should minimise redundancy in the metadata.
The conventions should minimise the possibility for mistakes by data-writers and data-readers.
Conventions are provided to allow data-producers to describe the data they wish to produce, rather than attempting to prescribe what data they should produce.
Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

davidhassell commented 4 years ago

@JonathanGregory Thank very much for suggesting this.

In the first point, which I think is correct, I wonder if we should explain "no external resources are needed to interpret CF-netCDF metadata." with a bit more detail, as resources such as the standard name table could possibly be seen as such an external resource (even though is is wholly a part of CF). Also resources such as PROJ, WKT, WoRMS, etc. (which are certainly not part of CF).

Is the answer to such questions that such resources just provide values from vocabularies, and identification of such vocabularies is clearly defined by the CF conventions?

Please correct me if I've misinterpreted what is meant by "external resources" here.

Thanks, David

JonathanGregory commented 4 years ago

Dear @davidhassell

Yes, that's a good point. I suppose the important word is "interpret". It should be possible to get some understanding of what the CF metadata means just by reading it, without having to look up anything. In particular, that means we don't use numerical codes. Instead, we use controlled vocabularies. As you say, there are external resources which define the terms allowed by these vocabularies. However, the terms themselves should be self-explanatory, so that the file alone can be interpreted by a reader, although they may have to look elsewhere to find out all the detail of its meaning. Standard names, PROJ projection names, grid mapping names, cell methods etc. are all capable of interpretation by themselves, and I think that's important. Do you agree? If so, perhaps we can find a way to explain this succintly.

I feel that WKT is not so clearly self-explanatory. It's a bit more like code. That is one reason why I have reservations about allowing WKT to be supplied without equivalent CF-defined metadata. Taxa IDs also might not be self-explanatory, but this may not prevent the file from being interpreted, if you are clear that it is some biological taxon which is meant.

Jonathan

JonathanGregory commented 4 years ago

To address @davidhassell's concern, about "external resources", for the first point I now propose instead

In order to make CF-netCDF files self-describing, CF-netCDF metadata uses phrases from controlled vocabularies which are chosen as far as practically possible to be self-explanatory without reference to resources external to the netCDF file (although precise definitions and descriptions are also provided in CF documents). CF-netCDF metadata does not use opaque codes which could not be interpreted without external resources, and the meaning of terms is generally indicated by keyword rather than depending on order.

That's longer than I'd like, but is that better?

Jonathan

JonathanGregory commented 4 years ago

Slightly better, avoiding multiple negatives:

In order to make CF-netCDF files self-describing, CF-netCDF metadata uses phrases from controlled vocabularies which are chosen as far as practically possible to be self-explanatory without reference to resources external to the netCDF file (although precise definitions and descriptions are also provided in CF documents). CF-netCDF metadata does not use opaque codes whose interpretation would require external resources, and the meaning of terms is generally indicated by keyword rather than depending on order.

taylor13 commented 4 years ago

I might reorder the first sentence (so that the parenthetical statement at the end directly follows the phrase it modifies): "In defining some of the descriptors that make CF-netCDF files self-describing, CF-netCDF relies on controlled vocabularies containing terms that are chosen as far as practically possible to be self-explanatory (and with precise definitions provided in CF documents). It should be possible to interpret CF-netCDF metadata without consulting resources external to CF."

In general, I think you have captured the principles we have been following in constructing the convention. Do you think the following additional principles should be included?

(Perhaps in conjunction with the 2nd to last point in your list): "Most of the metadata defined by the conventions is optional because the full descriptive richness of CF may not be needed (or relevant) in interpreting a particular dataset."
Perhaps in conjunction with the last point in your list, we should specifically say something about the data model?

JonathanGregory commented 4 years ago

Dear Karl

Thanks. I have reformulated principle (1), combining yours and mine, and stating the purpose at the start. I think "self-describing" means not using anything outside the file itself, which is stronger than what you suggested. Is this OK?

In response to your first additional point, I've appended a bit to principle (8). Thanks for your second additional point, which is important. I have inserted principle (3) about this. Finally, I have added principle (10), which is partly a corollary of (9), and partly something we've done for its own sake, often advocated by Steve Hankin.

Thus, here is the current proposal:

(1) CF-netCDF metadata is designed to make each dataset self-describing, meaning that it should be interpretable without reference to resources outside itself. To achieve this purpose, CF-netCDF does not use codes, but instead relies on controlled vocabularies containing terms that are chosen as far as practically possible to be self-explanatory (and whose precise definitions are provided in CF documents).

(2) The conventions are changed only as actually required by common use-cases, and not for needs which cannot be anticipated with certainty.

(3) [New] In order to keep them logical, consistent in approach and as simple as possible, the netCDF conventions are devised with and within the conceptual framework of the CF data model.

(4) The conventions should be practicable for both producers and users of data.

(5) The metadata should be both easily readable by humans and easily parsable by programs.

(6) [Slightly reordered] To avoid potential inconsistency within the metadata, the conventions should minimise redundancy.

(7) The conventions should minimise the possibility for mistakes by data-writers and data-readers.

(8) Conventions are provided to allow data-producers to describe the data they wish to produce, rather than attempting to prescribe what data they should produce; [new] consequently most CF conventions are optional.

(9) Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

(10) [New] Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one).

Cheers

Jonathan

cameronsmith1 commented 4 years ago

Should there be a bullet about grammar of standard_names? I know we never established a formal grammar system, but a de facto grammar system mostly evolved anyway. Perhaps a bullet such as the following:

(+) The construction of new standard_names should follow the grammar conventions established by previous standard names as much as possible.

DanHollis commented 4 years ago

Just thinking about this long-standing principle that files be self-describing. Section 2.6 of the convention states:

“…a file may also contain non-standard attributes. Such attributes do not represent a violation of this standard. Application programs should ignore attributes that they do not recognise or which are irrelevant for their purposes.”

This suggests that there is nothing stopping me from adding opaque metadata to my files (e.g. an EPSG code – this is in fact something that we do for our own internal use). However, someone using generic tools to examine the file (ncview, ncdump) won’t know which attributes are part of the CF standard and which are not. The fact that a subset of the attributes (all the CF attributes plus, possibly, some of the non-CF attributes) are self-describing becomes irrelevant if some of the (non-CF) attributes are opaque.

It appears that the only way for a user to distinguish between CF and non-CF attributes (to work out which are a sufficient subset to interpret the file) is to refer to the CF convention and/or any documentation supplied by the data provider, or to use software that is aware of the CF standard. I would argue that this means that CF-compliant files are not really self-describing given the need to reference external resources (standards/documentation or software). Even if the file did not contain any non-CF attributes, a user unfamiliar with CF would not know this without reference to external resources.

If opaque non-CF metadata are permitted then I’m not sure of the benefit of CF requiring the rest of the attributes to be self-describing (however good that might be in principle). This implies that either non-CF attributes should be prohibited or the principle of self-describing files should be dropped.

Am I missing something? What do others think?

Dan

From: JonathanGregory notifications@github.com Sent: Thursday, 25 June 2020 09:57 To: cf-convention/cf-conventions cf-conventions@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [cf-convention/cf-conventions] State the principles for design of the CF conventions (#273)