cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
85 stars 43 forks source link

State the principles for design of the CF conventions #273

Closed JonathanGregory closed 3 years ago

JonathanGregory commented 4 years ago

Title

State the principles for design of the CF conventions

Moderator

@davidhassell

Moderator Status Review [last updated: YY/MM/DD]

Brief comment on current status, update periodically

Requirement Summary

To state the principles which should be borne in mind when designing proposed changes to the CF convention. A brief statement of them was published by Hassell et al. (2017, 10.5194/gmd-10-4619-2017).

Technical Proposal Summary

Add a statement of the principles near the start of the conventions document.

Benefits

Proposers of enhancements to the convention will be made better aware.

Status Quo

These principles have been applied throughout the history of CF and often mentioned in discussions, but have not been written down in the CF standard or elsewhere on the website up to now.

Detailed Proposal

Insert a new section 1.2 in the conventions document, entitled "Principles for design of the CF conventions" (following section 1.1 on "Goals"), and renumber the following sections. The text of the proposed new section is:

The following principles are followed in the design of these conventions:

davidhassell commented 4 years ago

@JonathanGregory Thank very much for suggesting this.

In the first point, which I think is correct, I wonder if we should explain "no external resources are needed to interpret CF-netCDF metadata." with a bit more detail, as resources such as the standard name table could possibly be seen as such an external resource (even though is is wholly a part of CF). Also resources such as PROJ, WKT, WoRMS, etc. (which are certainly not part of CF).

Is the answer to such questions that such resources just provide values from vocabularies, and identification of such vocabularies is clearly defined by the CF conventions?

Please correct me if I've misinterpreted what is meant by "external resources" here.

Thanks, David

JonathanGregory commented 4 years ago

Dear @davidhassell

Yes, that's a good point. I suppose the important word is "interpret". It should be possible to get some understanding of what the CF metadata means just by reading it, without having to look up anything. In particular, that means we don't use numerical codes. Instead, we use controlled vocabularies. As you say, there are external resources which define the terms allowed by these vocabularies. However, the terms themselves should be self-explanatory, so that the file alone can be interpreted by a reader, although they may have to look elsewhere to find out all the detail of its meaning. Standard names, PROJ projection names, grid mapping names, cell methods etc. are all capable of interpretation by themselves, and I think that's important. Do you agree? If so, perhaps we can find a way to explain this succintly.

I feel that WKT is not so clearly self-explanatory. It's a bit more like code. That is one reason why I have reservations about allowing WKT to be supplied without equivalent CF-defined metadata. Taxa IDs also might not be self-explanatory, but this may not prevent the file from being interpreted, if you are clear that it is some biological taxon which is meant.

Jonathan

JonathanGregory commented 4 years ago

To address @davidhassell's concern, about "external resources", for the first point I now propose instead

In order to make CF-netCDF files self-describing, CF-netCDF metadata uses phrases from controlled vocabularies which are chosen as far as practically possible to be self-explanatory without reference to resources external to the netCDF file (although precise definitions and descriptions are also provided in CF documents). CF-netCDF metadata does not use opaque codes which could not be interpreted without external resources, and the meaning of terms is generally indicated by keyword rather than depending on order.

That's longer than I'd like, but is that better?

Jonathan

JonathanGregory commented 4 years ago

Slightly better, avoiding multiple negatives:

In order to make CF-netCDF files self-describing, CF-netCDF metadata uses phrases from controlled vocabularies which are chosen as far as practically possible to be self-explanatory without reference to resources external to the netCDF file (although precise definitions and descriptions are also provided in CF documents). CF-netCDF metadata does not use opaque codes whose interpretation would require external resources, and the meaning of terms is generally indicated by keyword rather than depending on order.

taylor13 commented 4 years ago

I might reorder the first sentence (so that the parenthetical statement at the end directly follows the phrase it modifies): "In defining some of the descriptors that make CF-netCDF files self-describing, CF-netCDF relies on controlled vocabularies containing terms that are chosen as far as practically possible to be self-explanatory (and with precise definitions provided in CF documents). It should be possible to interpret CF-netCDF metadata without consulting resources external to CF."

In general, I think you have captured the principles we have been following in constructing the convention. Do you think the following additional principles should be included?

  1. (Perhaps in conjunction with the 2nd to last point in your list): "Most of the metadata defined by the conventions is optional because the full descriptive richness of CF may not be needed (or relevant) in interpreting a particular dataset."

  2. Perhaps in conjunction with the last point in your list, we should specifically say something about the data model?

JonathanGregory commented 4 years ago

Dear Karl

Thanks. I have reformulated principle (1), combining yours and mine, and stating the purpose at the start. I think "self-describing" means not using anything outside the file itself, which is stronger than what you suggested. Is this OK?

In response to your first additional point, I've appended a bit to principle (8). Thanks for your second additional point, which is important. I have inserted principle (3) about this. Finally, I have added principle (10), which is partly a corollary of (9), and partly something we've done for its own sake, often advocated by Steve Hankin.

Thus, here is the current proposal:

(1) CF-netCDF metadata is designed to make each dataset self-describing, meaning that it should be interpretable without reference to resources outside itself. To achieve this purpose, CF-netCDF does not use codes, but instead relies on controlled vocabularies containing terms that are chosen as far as practically possible to be self-explanatory (and whose precise definitions are provided in CF documents).

(2) The conventions are changed only as actually required by common use-cases, and not for needs which cannot be anticipated with certainty.

(3) [New] In order to keep them logical, consistent in approach and as simple as possible, the netCDF conventions are devised with and within the conceptual framework of the CF data model.

(4) The conventions should be practicable for both producers and users of data.

(5) The metadata should be both easily readable by humans and easily parsable by programs.

(6) [Slightly reordered] To avoid potential inconsistency within the metadata, the conventions should minimise redundancy.

(7) The conventions should minimise the possibility for mistakes by data-writers and data-readers.

(8) Conventions are provided to allow data-producers to describe the data they wish to produce, rather than attempting to prescribe what data they should produce; [new] consequently most CF conventions are optional.

(9) Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

(10) [New] Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one).

Cheers

Jonathan

cameronsmith1 commented 4 years ago

Should there be a bullet about grammar of standard_names? I know we never established a formal grammar system, but a de facto grammar system mostly evolved anyway. Perhaps a bullet such as the following:

(+) The construction of new standard_names should follow the grammar conventions established by previous standard names as much as possible.

DanHollis commented 4 years ago

Just thinking about this long-standing principle that files be self-describing. Section 2.6 of the convention states:

“…a file may also contain non-standard attributes. Such attributes do not represent a violation of this standard. Application programs should ignore attributes that they do not recognise or which are irrelevant for their purposes.”

This suggests that there is nothing stopping me from adding opaque metadata to my files (e.g. an EPSG code – this is in fact something that we do for our own internal use). However, someone using generic tools to examine the file (ncview, ncdump) won’t know which attributes are part of the CF standard and which are not. The fact that a subset of the attributes (all the CF attributes plus, possibly, some of the non-CF attributes) are self-describing becomes irrelevant if some of the (non-CF) attributes are opaque.

It appears that the only way for a user to distinguish between CF and non-CF attributes (to work out which are a sufficient subset to interpret the file) is to refer to the CF convention and/or any documentation supplied by the data provider, or to use software that is aware of the CF standard. I would argue that this means that CF-compliant files are not really self-describing given the need to reference external resources (standards/documentation or software). Even if the file did not contain any non-CF attributes, a user unfamiliar with CF would not know this without reference to external resources.

If opaque non-CF metadata are permitted then I’m not sure of the benefit of CF requiring the rest of the attributes to be self-describing (however good that might be in principle). This implies that either non-CF attributes should be prohibited or the principle of self-describing files should be dropped.

Am I missing something? What do others think?

Dan

From: JonathanGregory notifications@github.com Sent: Thursday, 25 June 2020 09:57 To: cf-convention/cf-conventions cf-conventions@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [cf-convention/cf-conventions] State the principles for design of the CF conventions (#273)

Dear Karl

Thanks. I have reformulated principle (1), combining yours and mine, and stating the purpose at the start. I think "self-describing" means not using anything outside the file itself, which is stronger than what you suggested. Is this OK?

In response to your first additional point, I've appended a bit to principle (8). Thanks for your second additional point, which is important. I have inserted principle (3) about this. Finally, I have added principle (10), which is partly a corollary of (9), and partly something we've done for its own sake, often advocated by Steve Hankin.

Thus, here is the current proposal:

(1) CF-netCDF metadata is designed to make each dataset self-describing, meaning that it should be interpretable without reference to resources outside itself. To achieve this purpose, CF-netCDF does not use codes, but instead relies on controlled vocabularies containing terms that are chosen as far as practically possible to be self-explanatory (and whose precise definitions are provided in CF documents).

(2) The conventions are changed only as actually required by common use-cases, and not for needs which cannot be anticipated with certainty.

(3) [New] In order to keep them logical, consistent in approach and as simple as possible, the netCDF conventions are devised with and within the conceptual framework of the CF data model.

(4) The conventions should be practicable for both producers and users of data.

(5) The metadata should be both easily readable by humans and easily parsable by programs.

(6) [Slightly reordered] To avoid potential inconsistency within the metadata, the conventions should minimise redundancy.

(7) The conventions should minimise the possibility for mistakes by data-writers and data-readers.

(8) Conventions are provided to allow data-producers to describe the data they wish to produce, rather than attempting to prescribe what data they should produce; [new] consequently most CF conventions are optional.

(9) Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

(10) [New] Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one).

Cheers

Jonathan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/cf-convention/cf-conventions/issues/273#issuecomment-649396973, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANWNP6RQO3WV4MLG4QLFBQ3RYMGOFANCNFSM4NZQXDKQ.

MaartenSneepKNMI commented 4 years ago

Dan Hollis wrote:

If opaque non-CF metadata are permitted then I’m not sure of the benefit of CF requiring the rest of the attributes to be self-describing (however good that might be in principle). This implies that either non-CF attributes should be prohibited or the principle of self-describing files should be dropped.

Am I missing something? What do others think?

I've always the self-describing principle to apply to the data, not to the metadata. In fact, the data becomes self describing because of the metadata. Prohibiting additional metadata will cause severe trouble in various fields of application, for instance when there is a requirements to support multiple metadata conventions (ACDD comes to mind) that are supplemental to CF.

Maarten

DanHollis commented 4 years ago

Hi Maarten,

I agree, prohibiting additional metadata would be a problem and is not something I am advocating. However, self-describing files seems to be a key principle of CF (which, I think, has prevented the adoption of opaque codes such as EPSG), so I was interested in how this stacked up, logically, with allowing non-CF metadata.

I take your point that the metadata describes the data. However, if you don’t know which metadata are sufficient to describe the data without reference to external resources then I would argue that is not self-describing. The example of an EPSG code is particularly interesting. As I’m sure you are aware, there are already long discussions about grid mapping parameters vs WKT. If I release a file with an EPSG attribute as well then users may be even more confused regarding which takes precedence and which should be ignored! Of course, you may rightly say that is bad practice on my part. However, with the convention as it stands, I can legitimately claim that file to be CF-compliant.

Regards,

Dan

From: Maarten Sneep notifications@github.com Sent: Thursday, 25 June 2020 13:06 To: cf-convention/cf-conventions cf-conventions@noreply.github.com Cc: Hollis, Dan dan.hollis@metoffice.gov.uk; Comment comment@noreply.github.com Subject: Re: [cf-convention/cf-conventions] State the principles for design of the CF conventions (#273)

Dan Hollis wrote:

If opaque non-CF metadata are permitted then I’m not sure of the benefit of CF requiring the rest of the attributes to be self-describing (however good that might be in principle). This implies that either non-CF attributes should be prohibited or the principle of self-describing files should be dropped.

Am I missing something? What do others think?

I've always the self-describing principle to apply to the data, not to the metadata. In fact, the data becomes self describing because of the metadata. Prohibiting additional metadata will cause severe trouble in various fields of application, for instance when there is a requirements to support multiple metadata conventions (ACDD comes to mind) that are supplemental to CF.

Maarten

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/cf-convention/cf-conventions/issues/273#issuecomment-649500189, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANWNP6S42NB6TGWAY7FWY23RYM4R7ANCNFSM4NZQXDKQ.

balaji-gfdl commented 4 years ago

Dear @JonathanGregory I would not be too insistent on the distinction between "file" and "dataset" in (1). The concepts of file and filesystem may be obsolescent in the evolution of storage technology (in particular, zarr is emerging as a netCDF backend (in the imminent release actually), where attributes will be stored in a distinct "object" which may look like a "file" on some storage technologies.

Whereas I think "dataset" is not tied to a specific technology. Thanks.

MaartenSneepKNMI commented 4 years ago

Hi Dan,

If an attribute name is not part of CF, then the CF conventions have no opinion on them. That external reference to the CF conventions (and the netCDF users guide) is implicit, and unavoidable. Luckily most of the CF conventions are rather obvious: a "units" attribute is human-readable without context. coordinates or bounds attributes are less trivial, but easy enough to figure out. In that sense: self describing is a goal, but it can never be reached.

On the other hand: if you manage to introduce contradictory metadata into a file, then you get exactly what you deserve, so the question of precedence should not matter: use what you can read and understand. If that is part of CF: great. If you have to be creative, or if you supply data that helps some code to better handle the data: excellent.

Kind regards,

Maarten

rabernat commented 4 years ago

One solution to the question of extra non-cf metadata or multiple conventions in the same dataset is to use namespace prefixes, e.g. all cf metadata could be prefixed by cf:. That would give us things like

cf:standard_name 
cf:units
acdd:geospatial_bounds

etc.

Obviously it's too late for such practices to be applied retroactively to existing files, but perhaps it would be useful going forward?

taylor13 commented 4 years ago

Regarding the recent revisions to the "principles", I like the revisions you've made in https://github.com/cf-convention/cf-conventions/issues/273#issuecomment-649396973 , and I think there are no references to "files" now (only to "datasets"), unless I've missed something. The suggestion by Philip in https://github.com/cf-convention/cf-conventions/issues/273#issuecomment-649414779 sounds like a good one to me.

And I think we should consider whether we could do something to identify the CF attributes in a file, maybe simply listing the ones appearing in the file in the "Conventions" attribute, enclosed in parentheses (e.g., Conventions = "CF1.8 (standard_name, units, coordinates, bounds, calendar)". Of course this would not be particularly elegant, but would work. If someone wants to pursue this or related suggestions, I guess that discussion should be moved elsewhere.

As for a need to make it easy for folks to distinguish the CF attributes from others, consider the following analogy: Suppose authors submitting a paper to British journal are required to write it in English, but are allowed to include short passages in a foreign language (without translation). Many readers literate in only English would be able to understand most of the article (and would likely skip over the sections in a foreign language). We would not expect the author to point out that the foreign language bit was not English. We expect those reading netCDF files and interpreting the metadata to do the same thing: recognize what is CF and what is not, and be able to interpret the CF attributes.

sethmcg commented 4 years ago

For me, Ryan's suggestion of a cf-prefix on attributes would be too large a break in backwards-compatibility; I don't think I could support it. Karl's suggestion of listing cf attributes somewhere seems more reasonable. Though maybe it would be better to put it in a separate global attribute rather than in Conventions, which could be a list of multlple conventions?

I also worry about a list of every CF attribute getting really long and being fragile to changes. What if instead of specific attributes, there was a controlled vocabulary of concepts or elements to list, so you'd have something like cf-elements = "bounds, climatology, flags, mapping, stdname"?

sethmcg commented 4 years ago

There's a subtle point here that I just got tripped up by (I wrote and then deleted a long but erroneous comment), which is that the first principle of self-description is talking about the metadata being self-describing, not the entire file contents. Is that correct?

Which is to say, in this sentence: "CF-netCDF metadata is designed to make each dataset self-describing, meaning that it should be interpretable without reference to resources outside itself," the word 'it' refers to 'metadata', not to 'dataset'. True?

If so, it probably wouldn't hurt to call that out very explicitly and bluntly; I think it's likely to confuse more than just me.

JonathanGregory commented 4 years ago

Dear @sethmcg Thanks for pointing out this subtlety, which I had not thought about properly. I meant "it" to refer to the dataset. The purpose of the metadata is to describe the contents of the dataset. With the metadata included, the dataset is self-describing, provided also that the metadata can be sufficiently understood without looking up anything. That is, there are two requirements for a self-describing dataset: (1) It includes metadata, which explains the data, (2) The metadata is self-explanatory (so there's no need for metametadata!). Does that make sense? Cheers, Jonathan

sethmcg commented 4 years ago

@JonathanGregory It does, but in that case, I think the current wording is too strong. I don't think zero external references is achievable. Much as we try to make them intuitively obvious and self-explanatory, it's not uncommon to have to look up the precise definitions of controlled vocabulary terms on the website, and I call that an external reference. (In my book, anything not in the netcdf header itself is external.)

Plus, we now have some issues in development (such as provenance tracking and I think cell_measures for CMIP6 ocean data, IIRC?) where there are strong practical reasons for wanting to reference information that is stored separately, and I think CF is going to have to bend to accommodate them. (There's a potential dodge in the form of saying "dataset" rather than "file", where it could be argued that those are part of the same dataset, but that opens up a hole, because then where do you stop?)

So I think it needs to be loosened a little. This is a guiding principle, not a definition, so what about something along the lines of: "CF-netCDF metadata is designed to make each dataset as self-describing as possible, meaning that the need to reference external resources to interpret the dataset should be minimal."? And then perhaps we might want some discussion about the cases where this principle has been bent a little for reasons of feasibility and practicality.

JonathanGregory commented 4 years ago

Dear all

Thanks for the discussions. I think non-CF metadata in a CF file can't be expected to conform to CF principles, but that doesn't mean we shouldn't stick to our principles. I agree with Karl @taylor13 that discussion of how to identify the convention followed by particular attributes belongs in a different issue from this one. It has been raised before, in fact.

Taking up the suggestions from @sethmcg about self-describing being a guiding principle (not an absolute requirement) in (1) and Philip @cameronsmith1 about standard names in (3), here's the present proposal:

(1) CF-netCDF metadata is designed to make datasets self-describing as far as practically possible. A self-describing dataset is one which can be interpreted without need for reference to resources outside itself, and the CF principle is to minimise that need. Therefore CF-netCDF does not use codes, but instead relies on controlled vocabularies containing terms that are chosen to be self-explanatory (but more detailed definitions of them are provided in CF documents).

(2) The conventions are changed only as actually required by common use-cases, and not for needs which cannot be anticipated with certainty.

(3) In order to keep them logical, consistent in approach and as simple as possible, the netCDF conventions are devised with and within the conceptual framework of the CF data model, and new standard names are constructed as far as possible to follow the syntax and vocabulary of existing standard names.

(4) The conventions should be practicable for both producers and users of data.

(5) The metadata should be both easily readable by humans and easily parsable by programs.

(6) To avoid potential inconsistency within the metadata, the conventions should minimise redundancy.

(7) The conventions should minimise the possibility for mistakes by data-writers and data-readers.

(8) Conventions are provided to allow data-producers to describe the data they wish to produce, rather than attempting to prescribe what data they should produce; consequently most CF conventions are optional.

(9) Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

(10) Because all previous versions must generally continue to be supported in software for the sake of archived datasets, and in order to limit the complexity of the conventions, there is a strong preference against introducing any new capability to the conventions when there is already some method that can adequately serve the same purpose (even if a different method would arguably be better than the existing one).

Would you support it in this form (Seth, Karl, Philip)? Are there further concerns or additions?

Best wishes

Jonathan

taylor13 commented 4 years ago

I didn't examine every sentence too carefully this time, but my reading is this covers everything and is clearly written. I support it. thanks to all contributors and David for moderating. Karl

sethmcg commented 4 years ago

Looks good to me! I approve.

cameronsmith1 commented 4 years ago

Looks good to me too.

JonathanGregory commented 4 years ago

OK, thanks all. I suppose that means the three-week period to see if there are any further concerns started three days ago on Friday 10th. Jonathan

davidhassell commented 4 years ago

Hello,

I have come back to this, re-read the whole thread, and also agree that last stated set of principles (https://github.com/cf-convention/cf-conventions/issues/273#issuecomment-656724527) looks good.

@JonathanGregory Would you like to create a pull request for some new text for the conventions document?

Thanks, all, for an interesting discussion.

JonathanGregory commented 4 years ago

I have created https://github.com/cf-convention/cf-conventions/pull/303 to implement this change

davidhassell commented 4 years ago

Thanks, @JonathanGregory. If there are no further substantive comments on the text (i.e. excluding spelling mistakes, etc.), I'll merge #303 in three weeks, on Monday 19th October.

davidhassell commented 3 years ago

Three weeks have elapsed with no further comment, so the changes have been merged and this issue is closed.

Thanks again, David