cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
83 stars 43 forks source link

Updating definition of coordinate variable to account for NUG changes #174

Closed martinjuckes closed 1 month ago

martinjuckes commented 5 years ago

In NetCDF4, coordinate variables can be string valued or character arrays. This is a change from NetCDF3 --- and, because of this change, the section of the CF Convention which refers to the NetCDF definition of coordinate variables contains a contradictions.

Section 1.2 on terminology states that a Coordinate Variable is defined "precisely as it is defined in the NUG section on coordinate variables": this now implies string and character values are allowed. However, the following sentence in the definition of a Coordinate Variable states that it should be "numeric data type with values that are ordered monotonically".

We could resolve this contradiction by either (1) retaining the restriction to numeric data types and dropping precise equivalence with NUG or (2) retaining precise equivalence with NUG and allowing string and char coordinate variables. Initial discussion on the CF Discussions email list has two votes in favour of option 1. This would require minor changes to the text. In principle there would be no change to the conformance requirements, but the requirement for numeric data types does not appear be represented in the conformance document and should be added.

If option (2) is taken, there is some ambiguity about the meaning of the monotonicity requirement which we would need to resolve.


PR #531 implements the decisions made by the following discussion.

JonathanGregory commented 5 years ago

Thanks for raising this issue, Martin. I agree with option (1), which means this is a defect, as you have labelled it. Proposals to remedy defects are accepted by default if no-one objects within three weeks. Jonathan

davidhassell commented 5 years ago

I support option 1) as well. Thanks for raising this, Martin.

JimBiardCics commented 5 years ago

I concur. Option 1) is the right choice. Do we need to add any clarifying verbiage regarding "label" coordinates?

In the old character array approach, a label coordinate variable was, by definition, an auxiliary coordinate since it was (almost) never 1D. A 1D string variable can meet the dimensional requirements for a coordinate variable. You can construct a variable with matching name and dimension name, for example, basin(basin). It seems we have three options:

A) State that this form is not allowed. Such variables would always need to have non-matching name and dimension name. This implies a cf-checker test that would fail if a 1D string variable had a matching dimension name.

B) State that this form is allowed, but that it will only be considered as an auxiliary coordinate for a variable if it is included in a coordinates attribute on the variable. This implies a cf-checker change to ignore 1D string variables like basin(basin) when building lists of coordinate variables.

C) State that string variables that would otherwise look like they could be coordinate variables are always auxiliary coordinate variables. This implies that a string variable such as basin(basin) would be understood as an auxiliary coordinate for a variable such as flow(time, basin) without the need to include it in a coordinates attribute. This likely implies a cf-checker change that would accept this form as valid.

In every case, there are implications for the data model, and for software packages such as cdms that attempt to build coordinate domains for data variables will need to deal appropriately with 1D string variables that appear to match the requirements for a coordiate variable. (@taylor13, you might want to chime in.)

martinjuckes commented 5 years ago

Hi Jim,

I hadn't spotted that problem. My preference is for (A).

As I understand it, the CF data model has a single namespace, so there can only be one basin object. In option (B), in the absence of a coordinates declaration, the NetCDF file has a dimension basin and a variable basin which, in the NetCDF data model live in separate namespaces. I don't think we can accommodate this in CF without a significant change to the data mode .. which does not appear to be justified here.

Option (C) looks awkward to me. The size of an auxiliary coordinate is usually determined by its own coordinates. If there are no true coordinates, it is not really "auxiliary" to anything.

Perhaps @davidhassell can comment on the data model issues.

martinjuckes commented 5 years ago

See also #139 .. which is a proposed enhancement to support string variables.

davidhassell commented 5 years ago

I don't think that this is a data model issue, which ever option we choose.

The data model doesn't care how its constructs are encoded - all it needs is to be able to do is unambiguously identify its constructs from a file. For example, if we were to say in the conventions "when you see a string variable like basin(basin), interpret the variable basin as an auxiliary coordinate variable for the "discrete" axis basin" then that would fit in perfectly with the existing data model.

JimBiardCics commented 5 years ago

I agree with @davidhassell that whatever option we choose, it's not difficult to incorporate into the data model. @martinjuckes, I have to confess that I don't understand your argument against options B) or C). I prefer options B) and C) myself. I don't see a good reason to make the entirely natural choice of creating a 1D string coordinate variable named basin(basin) illegal just because it is not numeric. I would love to adopt option C), but I can see the argument for option B).

martinjuckes commented 5 years ago

OK, it is good to see that there are no obstacles from the data model side of things. My mistake there.

Adopting (B) or (C) would require a change to the definition of an Auxiliary Coordinate, which currently includes the statement that "Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s)" -- it would be good to know what alternative is being proposed.

@JonathanGregory : you took an interest in this topic during the email discussions --- do you have a preference for any of the options Jim has outlined above (on the 16th).

ethanrd commented 5 years ago

I like option one, i.e., sticking with the restriction that coordinate variables are 1-D numeric, monotonic.

To me this is more of a correction than a change because the main NUG section on coordinate variables is not actually particularly precise in its definition of coordinate variables. It says

It is legal for a variable to have the same name as a dimension. Such variables have no special meaning to the netCDF library. However there is a convention that such variables should be treated in a special way by software using this library. ... If a dimension has a corresponding coordinate variable, then this provides an alternative, and often more convenient, means of specifying position along it. Current application packages that make use of coordinate variables commonly assume they are numeric vectors and strictly monotonic (all values are different and either increasing or decreasing).

The NUG Best Practices section on coordinate variables is a bit more precise. These should be made more consistent. There's discussion at Unidata on getting the NUG in its own repo so changes like this can be a bit more transparent.

martinjuckes commented 5 years ago

Hello All,

There appears to be consensus on point 1: treating the wording of the definition of a coordinate variable as a defect and modifying it to state clearly that CF requires coordinate variables to be of numeric data type.

I don't think we have identified a clear preference regarding Jim's suggestions about auxiliary coordinate variables. In going through the changes needed in the Conventions document I noticed that we have the sentences We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional near the start of Chapter 5. It appears to me that the logic of this recommendation applies equally well to rank 1 auxiliary coordinates of non-numeric data. Do you agree with this @JimBiardCics , or is there a a reason to treat uni-dimensional auxiliary coordinates differently here?

That is,

dimensions:
   n = 1;
   basin = 4;
variables:
   float data(basin,n);
      data:coordinates = "basin";
   string basin(basin,n);

is recommended against. This is phrased as a restriction on multidimensional coordinate variables, but I believe it makes sense to treat it as applying to all auxiliary coordinate variables. This would be a slight variation on Jim's option A above, recommending that basin(basin) be avoided for auxiliary coordinates, rather than saying it is disallowed, and also leaving it open to have a simple data variable of the form basin(basin).

i.e. we have an option (D): If a string or character variable has a single dimension matching its own name, it will be treated as a data variable with an index dimension. It is recommended that such variables should not be used as auxiliary coordinate variables.

I also noticed that the first sentence of Section 1.2 is also out of date, in that it states that the terms defined come from the NUG. I think this has been wrong for some time, as most of the terms appear to be specific to CF.

I've drafted some proposed updates to the document, in the 4 places that I believe need updating:

1. Update 1st sentence of Section 1.2

Current:

The terms in this document that refer to components of a netCDF file are based on terms
 defined in the NetCDF User’s Guide (NUG) [NUG] NUG. Some of those definitions are
 repeated below for convenience.

Proposed:

The terms in this document that refer to components of a NetCDF file are defined below.
Some of these are as defined in the NetCDF User’s Guide (NUG) [NUG] NUG and are
repeated below for convenience. Terms which are introduced by NUG are
marked *[NUG]*, and terms which are introduced in NUG and modified here are marked
*[NUG->CF]*.

2. Update terms in Section 1.2 to indicate those based on NUG

I think these 4 are the only terms that have a specific meaning in NUG.

3. Update Definition of Coordinate Variable in Section 1.2

Current:

We use this term precisely as it is defined in the NUG section on coordinate variables.
It is a one-dimensional variable with the same name as its dimension [e.g., time(time) ],
and it is defined as a numeric data type with values that are ordered monotonically.
Missing values are not allowed in coordinate variables.

Proposed:

A one-dimensional variable with the same name as its dimension [e.g., time(time) ]
and numeric data type, with values that are ordered monotonically. Missing values
are not allowed in coordinate variables. This matches the definition of this term in
the NUG section on coordinate variables, except that CF does not allow
non-numeric data types.

4. Recommendation on Auxiliary Coordinates (Chapter 5)

Fourth paragraph of chapter 5:

Current

We recommend that the name of a multidimensional coordinate variable should not
match the name of any of its dimensions because that precludes supplying a coordinate
variable for the dimension.

Proposed (multidimensional --> auxiliary):

We recommend that the name of an auxiliary coordinate variable should not match
the name of any of its dimensions because that precludes supplying a coordinate
variable for the dimension.
JimBiardCics commented 5 years ago

@martinjuckes As I read CF, there is no such thing as a "multi-dimensional coordinate variable" that is anything but an auxiliary coordinate variable. There is no provision in CF for connecting an auxiliary coordinate variable with a data variable apart from including the name in a coordinates attribute on the data variable.

Relevant parts of Section 1.2 declare

auxiliary coordinate variable Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

coordinate variable We use this term precisely as it is defined in the NUG section on coordinate variables. It is a one-dimensional variable with the same name as its dimension [e.g., time(time) ], and it is defined as a numeric data type with values that are ordered monotonically. Missing values are not allowed in coordinate variables

multidimensional coordinate variable An auxiliary coordinate variable that is multidimensional.

scalar coordinate variable A scalar variable (i.e. one with no dimensions) that contains coordinate data. Depending on context, it may be functionally equivalent either to a size-one coordinate variable (Section 5.7, "Scalar Coordinate Variables") or to a size-one auxiliary coordinate variable (Section 6.1, "Labels" and Section 9.2, "Collections, instances, and elements").

recommendation Recommendations in this convention are meant to provide advice that may be helpful for reducing common mistakes. In some cases we have recommended rather than required particular attributes in order to maintain backwards compatibility with COARDS. An application must not depend on a dataset’s adherence to recommendations.

These definitions make it quite clear that a non-numeric variable cannot ever be a coordinate variable. I've got no problem with clearing up the wording in Section 1.2, but everything that follows is dependent on these definitions of terms.

Section 5 paragraph 4 states

We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

I included the definition of "recommendation" earlier because this paragraph is a recommendation, not a requirement. Notice that the definition of a recommendation states "An application must not depend on a dataset’s adherence to recommendations." I can see a valid argument against human confusion for recommending against allowing a multidimensional auxiliary coordinate variable to have a dimension that has the same name as the variable name, but I think the assertion that such a construction precludes providing a coordinate variable for the dimension is incorrect, and deciding that a variable is a coordinate variable on the basis of a match between one dimension and the variable name is a particularly bad practice. An auxiliary coordinate variable can be fully compliant with CF and not follow this recommendation.

I think we should consider this recommendation to be defective. I'm going to break here on account of the length of this comment and continue in another.

JimBiardCics commented 5 years ago

I realized that I left a bit out above. The definitions in Section 1.2 also make it clear that a multidimensional numeric variable cannot ever be a coordinate variable. Both multidimensional numeric variables and string variables can be auxiliary coordinate variables.

JimBiardCics commented 5 years ago

@martinjuckes Now to the question of 1D string auxiliary coordinate variables. A 1D string variable with matching dimension and variable names is, per the Section 1.2 definitions (see my previous comment), a fully-compliant auxiliary coordinate variable. I believe that the construction

dimensions:
   basin = 4;
variables:
   float data(basin);
      data:coordinates = "basin";
   string basin(basin);

is valid according to the current version of CF. It satisfies all the requirements. It is also compliant with the current (non-binding) recommendation from Section 5 paragraph 4, which doesn't mention 1D string auxiliary coordinate variables.

I personally think it is fine for a 1D type string auxiliary coordinate variable to have matching variable and dimension names. It is evocative of its use as a 1D coordinate for a data variable (though it is not a "true" coordinate).

JimBiardCics commented 5 years ago

@martinjuckes After all that is said and done, I like your change to the Section 1.2 definition of coordinate variable. I disagree with your change to the Section 5 paragraph 4 recommendation that I believe to be defective. Here's an alternative suggestion.

We could change Section 5 paragraph 4 from

We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

to

We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because of the potential for such a construction to confuse users.

martinjuckes commented 5 years ago

Hi Jim,

(1) yes, it is clear that a multidimensional coordinate variable is always an auxiliary coordinate, and that the converse is not true;

(2) yes, it is clear that Section 5, para 4 is a recommendation, not a requirement;

(3) A construction of the form string basin(basin) is clearly not supported at the moment because string variables are not allowed. This will change with your proposed extension (pull request 140), but the usage you suggest above, in which string basin(basin) could be an auxiliary coordinate is clearly something that has never been allowed in the past and so surely must be considered as an extension. Would you like to propose it as an extension?

(4) Removing the sentence from Section 5 para 4 that states This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional is an interesting suggestion, but I don't see the relevance to this discussion. Why do you think our view on this point needs to change (it has been in there since version 1.0)? I'm in favour of clearing up text which is not needed, but I don't see the grounds for considering this sentence as a defect.

JimBiardCics commented 5 years ago

@martinjuckes Regarding your point (3): What convention prevents the construction string basin(basin) from being an auxiliary coordinate variable? I may well have missed it, but I haven't found one, myself. Character array variables have long been valid auxiliary coordinate variables. See Section 4.5 and Section 6.1. I see no need for an extension once the type string is accepted for use with variables.

Regarding your point (4): The sentence in the recommendation in Section 5 paragraph 4 that I am saying is defective is based on an appeal to a bad programming practice. The definition of recommendation contains the statement, "An application must not depend on a dataset’s adherence to recommendations." Applied to the sentence I am suggesting we remove, this definitional statement reads, "An application must not depend on a multidimensional [auxiliary] coordinate variable avoiding the use of a dimension that has the same name as the variable it is applied to." This directly contradicts the recommendation.

martinjuckes commented 5 years ago

Hello Jim,

Under the existing convention basin(basin) is a coordinate variable and is not allowed to be a string. That is the whole point of this discussion. Regarding point 4, I'm afraid don't see the contradiction you allude to.

Perhaps it would help to have some other views on these points. @JonathanGregory , @ethanrd : do you have any views on Jim's suggestion that a variable of the form string basin(basin) should be allowed as an auxiliary coordinate variable (see this and preceding posts)? This is an alternative to point 4 of my proposed changes, in which I suggest extending the existing recommendation against using auxiliary coordinates of the form string basin(basin,n) to the single-dimension case.

JonathanGregory commented 5 years ago

Dear Martin and Jim I'm inclined to agree with Martin that string basin(basin) should not be allowed as an auxiliary coordinate variable because it looks like a dimension (NUG) coordinate variable, in being 1D and having the name of its dimension. At present, such variables are 2D char variables and they have a name which differs from their dimension e.g. char basinname(basin,stringlength). The string version of such an aux coord var would be string basinname(basin). That means that string basin(basin) would not be allowed at all in CF - it can't be a dimension coord var either because it's not numeric. Not being able to use this construction might seem regrettable, since it looks convenient, but on the whole I feel that it would confuse the convention if we allowed it as an aux coord var. Best wishes Jonathan

taylor13 commented 5 years ago

Dear all,

I also think that if a string aux. coord. var. name and its dimension's name are identical, this could unnecessarily mislead some into thinking it is a coordinate variable (because of the NUG convention), so CF should not allow it.

best regards, Karl

JimBiardCics commented 5 years ago

@martinjuckes @JonathanGregory I hear where you are coming from. I may have been somewhat unclear before. What I am trying to point out is that the Conventions don't currently proscribe such a form. There is no prohibition in the text against having a 1D variable of non-numeric type that has matching variable and dimension name. Such a variable cannot be a coordinate variable, by definition, because it is non-numeric. It meets all the requirements for a valid auxiliary coordinate variable. There is also no prohibition in the text against having a multidimensional variable with a dimension name that matches the variable name. Such a variable meets all the requirements for a valid auxiliary coordinate variable.

The only basis I have found for any assertion regarding such variables is the defective recommendation in Section 5 paragraph 4, which can't actually be regarded as proscriptive because it is a recommendation.

If we wish to proscribe a variable of the form string basin(basin), we can certainly do so, but we will need to change the language of the definition of auxiliary coordinate variable in Section 1.2 to directly prohibit such a variable having a name that is the same as the name of any of its dimensions. If that is the community consensus I'm OK with that. I can see valid arguments for all three of the options I laid out in my earlier comment.

davidhassell commented 5 years ago

Is there a backward compatibility issue here? If we allow string basin(basin) we are likely to break software. Given that there is not a scientific use-case here (right? or have I missed it?) I think it would be best to disallow string basin(basin)

JimBiardCics commented 5 years ago

@davidhassell It's possible. Software that didn't check on the type might end up doing something unexpected. An appeal to potential software problems is problematic, as software that properly implements CF as written should check the type of a variable like string basin(basin) and decide it is not a coordinate variable. But that doesn't mean that people haven't made naïve assumptions when writing code. A variable such as char basin(basin), which is currently the only "non-numeric" option available and which CF declares to be a scalar "string" variable, is certainly problematic for a few reasons.

There is not a scientific use case. You could say the same for a number of aspects of CF.

The assumption appears to have been that auxiliary coordinate variables wouldn't ever look like coordinate variables. That's probably why we have the recommendation in Section 5 paragraph 4. We just didn't write the conventions to expressly prohibit such a case.

JimBiardCics commented 5 years ago

If we don't want to allow 1D non-numeric auxiliary coordinate variables to have the form <type> <name>(<name>) I suggest that we change the definition of auxiliary coordinate variable in Section 1.2 from

auxiliary coordinate variable Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

to

auxiliary coordinate variable Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, an auxiliary coordinate variable must not have the same name as the name of any of its dimensions.

martinjuckes commented 5 years ago

Hello All,

thanks for those comments. I realise now that there was an error in my proposed new definition of the coordinate variable" (item 3 here): it implied an unintended change in the interpretation of a variable int x(x) with non-monotonic values. This should be interpreted, I believe, as a coordinate variable with non-compliant values. My suggested text would have implied, rather unhelpfully, that it was merely a data variable".

As Jim has pointed out, there is a choice about how we deal with string variables here. We could say string basin(basin) s a coordinate variable with an invalid data type, or we could say that it is not a coordinate variable, but still a valid data variable with an index dimension. Jonathan and David have argued for the 1st, and I think Karl's comments also point in that direction. I think Jim has been following the 2nd interpretation. I've created two revised definitions that set out these options below.

3(a) Revised proposal for Coordinate Variable:

Any one-dimensional variable with the same name as its dimension [e.g., time(time)] is interpreted as a coordinate variable. Coordinate variables must have a numeric data type and data values that are ordered monotonically without any missing values. This matches the definition of this term in the NUG section on coordinate variables, except that CF does not allow non-numeric data types.

3(b) Alternative revised proposal for Coordinate Variable:

Any one-dimensional variable with the same name as its dimension [e.g., time(time)] and numeric data type is interpreted as a coordinate variable. Coordinate variables must have data values that are ordered monotonically without any missing values. This matches the definition of this term in the NUG section on coordinate variables, except that CF does not interpret non-numeric variables as coordinate variables.

A also prefer 3(a), as it reduces the room for confusion which might arise if string basin(basin) is considered as a coordinate variable in NUG and a data variable in CF.

If we accept 3(a), I'm not sure of necessity to change the auxiliary coordinate variable definition. In the current convention the form basin(basin,n) is allowed as a data variable or an auxiliary coordinate, but generates a warning if used as an auxiliary coordinate variable. This applies to all data types. The relevant recommendation is in section 5. It might help the clarity of the document to repeat this in the definition text, but I think we should keep it as a recommendation rather than strengthen it to a firm rule.

On the other hand, if we are going to be precise there is a problem with the phrase there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s). It is trying to say that there is no semantic relationship, or no formal relationship within the CF data model. There could be other types of relationships: the point is that CF doesn't care about any relationships.

@JimBiardCics : what do you think of the following alternative for auxiliary coordinate variable:

Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, relationships between the name of the auxiliary coordinate and the names of its dimensions have no significance. It is recommended, however, that an auxiliary coordinate variable does not have the same name as the name of any of its dimensions.

JimBiardCics commented 5 years ago

@martinjuckes If we go with the majority of responders on this thread regarding the acceptability of string basin(basin), then I agree that something close to your option 3(a) is best. And I'm OK with going that way since I appear to be the only one that feels differently.

Regarding your statement:

A also prefer 3(a), as it reduces the room for confusion which might arise if string basin(basin) is considered as a coordinate variable in NUG and a data variable in CF.

NUG does consider string basin(basin) as a coordinate variable, so I'm not quite sure what you are getting at. I guess you see 3(a) as precluding any variable with the form <type> basin(basin) if the type is not numeric. I think that is overly restrictive in general. I much prefer being quite clear about what is an allowed coordinate variable and what is an allowed auxiliary coordinate variable. Anything else would be considered a data variable. We can then make a recommendation that variables of the form name(name), name(name, ...), or name(..., name, ...) are to be avoided since they could be misinterpreted by inattentive readers as coordinate variables.

My next comment will include suggested new text.

I disagree with your comments regarding auxiliary coordinate variables. First and foremost, a recommendation does not, per its own definition, define anything that a person writing software to read a netCDF file should depend on.

Regarding your statement:

In the current convention the form basin(basin,n) is allowed as a data variable or an auxiliary coordinate, but generates a warning if used as an auxiliary coordinate variable.

I'm guessing there's a typo in this sentence, because there's not an 'auxiliary coordinate' as opposed to an 'auxiliary coordinate variable'.

Looking at the rest of the paragraph that begins "If we accept 3(a) ...", I believe that if we are going to disallow string basin(basin), then we should disallow char basin(basin,len). They are functionally equivalent. The recommendation in Section 5 paragraph 4 was an attempt to discourage any such constructions, but it is a defective recommendation in a number of ways (as I have described in previous comments). Whatever we do, let's address the defective recommendation.

I agree that the phrase about relationship between variable and dimension name in the definition of an auxiliary coordinate variable is unclear. We should change it.

JimBiardCics commented 5 years ago

@martinjuckes As I mentioned before, I am confused by your sentence

In the current convention the form basin(basin,n) is allowed as a data variable or an auxiliary coordinate, but generates a warning if used as an auxiliary coordinate variable.

As I read it again this morning (my time), I think I may see what you are getting at. Are you saying that the Conventions define a variable with the form <type> basin(basin) to be a valid data variable or auxiliary coordinate variable, but the cfchecker application generates a warning if the variable is used as an auxiliary coordinate variable?

JimBiardCics commented 5 years ago

How about this approach? Define a coordinate variable to be a 1D numeric monotonic variable with matching variable and dimension name that does not contain any fill or missing values. Define an auxiliary coordinate variable to be an N-D variable with a name that does not match any dimension name that contains data that is intended to be interpreted as coordinate information. Remove the recommendation from Section 5. This would allow someone to make a variable of the form <type> name(name) that isn't any sort of coordinate variable, but would actively prohibit constructions such as string basin(basin) or int basin(basin, len). In specific terms, make the changes below.

In Section 1.2 (and in the order below)

Coordinate Variable

A coordinate variable is a one-dimensional variable with a numeric type that has the same name as the name of its dimension (e.g., int time(time)). The contents of a coordinate variable shall be monotonic — that is, consistently either increasing or decreasing in value, and shall not contain fill or missing values. Such a variable functions as a domain axis for any variable that has the corresponding dimension name as one of its dimensions. This definition differs from the definition in the NUG which does not require numeric type, monotonicity, or lack of fill or missing values.

Auxiliary Coordinate Variable

An auxiliary coordinate variable is a variable containing coordinate information which does not meet all the requirements of a coordinate variable. An auxiliary coordinate variable shall not have a name matching any of the names of its dimensions. An auxiliary coordinate variable may have a non-numeric type (allowing it to represent a category or label axis), may be non-monotonic, and may contain fill and missing values.

In Section 5

Delete paragraph 4.

davidhassell commented 5 years ago

@JimBiardCics I like this approach.

I'd like to pepper in a few "strictly"s, and I'd rather shy away from your use of "domain axis" in the text, simply because a "domain axis construct" is a CF data model construct that does not map to a CF-netCDF coordinate variable.

How about (new text in italics):

A coordinate variable is a one-dimensional variable with a numeric type that has the same name as the name of its dimension (e.g., int time(time)). The contents of a coordinate variable shall be strictly monotonic — that is, consistently either strictly increasing or strictly decreasing in value, and shall not contain fill or missing values. Such a coordinate variable is able to unambiguously provide cell locations for any variable that has the corresponding dimension name as one of its dimensions. This definition differs from the definition in the NUG which does not require numeric type, monotonicity, or lack of fill or missing values.

and in the auxiliary coordinate paragraph: "non-strictly-monotic", (if that makes grammatical sense!).

JimBiardCics commented 5 years ago

@davidhassell That's a good point. Simple monotonicity can have repeated values. I'm good with the first strictly. I think the second and third are perhaps a bit much. How about we either drop them or drop the word 'consistently'?

Auxiliary coordinates have no monotonicity requirements whatsoever, so I don't think the use of strictly in that instance is warranted.

When I mentioned 'domain axis', I was meaning it in more of a mathematical sense, as in the domain vs the range (or co-domain) of a function. How about the phrase 'mathematical coordinate axis'?

Or how about this? A coordinate variable functions as an independent coordinate axis for values in a variable that has the corresponding dimension name as one of its dimensions.

martinjuckes commented 5 years ago

Dear Jim, David,

I still see no grounds for modifying the definition of a auxiliary coordinate variable.

I don't see any scientific use case for accepting a variable of the form int x(x) which is non-monotonic as an auxiliary coordinate variable. Such a variable would, in the current convention, be disallowed. Hence, I oppose Jim's suggested wording above.

I'm puzzled by the suggestion that we change the requirement on monotonicity to be a requirement on strict-monotonicity. What is the basis for requiring this change? What about backward compatibility? If we were starting from scratch I can see that it might make sense, but the requirement has been clearly stated as monotonicity since CF-1.0.

davidhassell commented 5 years ago

Hi Martin, Jim,

Re. "strict": Is it not the case that strict monotonicity is what is meant, though?, and what has been assumed by everyone? I would suggest that omitting the word "strict" from CF has been a defect. Outside of the DSG section, the word "monotonic" is used just once, in the definition of a coordinate variable (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#terminology). The NUG refers to "Current application packages that make use of coordinate variables commonly assume they are numeric vectors and strictly monotonic (all values are different and either increasing or decreasing)." (https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_set_components.html#coordinate_variables, with the caveats of this discussion to date!)

(I have to go now, but will think about Martin's other points later ...)

martinjuckes commented 5 years ago

Hi David,

OK .. you may be right. I can see that the CF Conformance document also says "strictly", so perhaps it is reasonable to regard the omission of this word from the convention text as an error.

JimBiardCics commented 5 years ago

@davidhassell My only question regarding the use of 'strictly' is the 'consistently strictly' bit. I'm fine with either

The contents of a coordinate variable shall be strictly monotonic — that is, consistently either increasing or decreasing in value, and shall not contain fill or missing values.

or

The contents of a coordinate variable shall be strictly monotonic — that is, either strictly increasing or strictly decreasing in value, and shall not contain fill or missing values.

JimBiardCics commented 5 years ago

@martinjuckes As David has pointed out, the assumption has always been strict monotonicity, even though we haven't used the specific term. I'd naïvely assumed in the past that monotonic and strictly monotonic were one in the same. (I learned a new thing yesterday!)

JimBiardCics commented 5 years ago

@martinjuckes Regarding your comment

I don't see any scientific use case for accepting a variable of the form int x(x) which is non-monotonic as an auxiliary coordinate variable. Such a variable would, in the current convention, be disallowed. Hence, I oppose Jim's suggested wording above.

The definitions of coordinate variable and auxiliary coordinate variable that I suggested specifically prohibit treating a not-strictly-monotonic variable of the form int x(x) as either a coordinate variable or an auxiliary coordinate variable. It fails the coordinate variable test because it is not strictly monotonic. It fails the auxiliary coordinate variable test because the variable has the same name as its dimension. It is a data variable of some sort, but not a coordinate of some sort.

The term 'shall' in contract-type documents has the meaning of 'must have'. If I shall eat an egg for breakfast, it means I am not allowed to do otherwise.

JimBiardCics commented 5 years ago

@martinjuckes I strongly believe that part of our problem with the different coordinate variable definitions is that they are both imprecise and easy to misinterpret. I think the best way out of this is to make both of them clearer, which includes putting the definition of coordinate variable first.

martinjuckes commented 5 years ago

Hello Jim,

I agree that we need clear definitions. I'm glad that you do not intent a variable such as int x(x) with non-monotonic values to be allowed as an auxiliary coordinate variable. The wording you have proposed is not acceptable, because your wording would allow the usage which you don't want.

JimBiardCics commented 5 years ago

@martinjuckes It specifically doesn't allow it. Please explain how it allows such a variable.

martinjuckes commented 5 years ago

Jim, you list 3 criteria a coordinate variable must satisfy, and then say that a variable may be an auxiliary coordinate variable if it doesn't satisfy all of these.

I believe we should stick to the current interpretation, that int x(x) with non-monotonic values is a coordinate variable but had non-compliant values.

JimBiardCics commented 5 years ago

@martinjuckes Read all the requirements for an auxiliary coordinate variable. I think you'll find that your suggested variable is prohibited.

Auxiliary Coordinate Variable

An auxiliary coordinate variable is a variable containing coordinate information which does not meet all the requirements of a coordinate variable. An auxiliary coordinate variable shall not have a name matching any of the names of its dimensions. An auxiliary coordinate variable may have a non-numeric type (allowing it to represent a category or label axis), may be non-monotonic, and may contain fill and missing values.

martinjuckes commented 5 years ago

OK, sorry I mis-read that.

(1) Are you intending to allow int x(x) with non-monotonic data values as a data variable?

(2) Why are you introducing new restrictions on auxiliary coordinate variables?

JimBiardCics commented 5 years ago

@martinjuckes

(1) Yes. Based on the conversation up to now, I thought that was what people thought was best.

(2) The "new restrictions" are the very things that I understand everyone to have said they wanted to prohibit in auxiliary coordinate variables. The point of the defective recommendation in Section 5 is to steer people away from constructions such as string basin(basin), int time(time, other), etc. The change to the definition of _auxiliary coordinate variable makes that formal. I was under the impression that you declared support for such a restriction in this comment.

As it stands, without changing the definition of auxiliary coordinate variable, there is nothing to prevent string basin(basin), int time(time) where time is not-strictly-monotonic, or float lat(lat, other) being used as auxiliary coordinate variables. They may not be recommended, but a recommendation is not a requirement.

martinjuckes commented 5 years ago

Hi Jim,

On point (1), I'm surprised. This is a change in the convention, and I haven't seen a scientific use case. The current rule is, I believe, that int x(x) is a coordinate variable and the values are non-compliant if they are not strictly (according to the conformance document) monotonic. Allowing int x(x) to be a data variable when it has non-monotonic values would, I think be a regrettable source of confusion.

On point (2), there has been a clear preference expressed by @JonathanGregory and @taylor13 against allowing variables of the form string basin(basin). If we accept this, there is no need for additional rule in the auxiliary coordinate variable definition. If you don't accept it, I would find it helpful if you could address their concerns.

JimBiardCics commented 5 years ago

@martinjuckes Regarding (1), I don't see any convention that has ever prohibited a not-strictly-monotonic 1D data variable that has a matching dimension name.

Regarding (2), the current definition of auxiliary coordinate variable is imprecise and allows such constructions as char time(time, len), etc., as I just pointed out. These are constructions that @taylor13 and @JonathanGregory, among others, appear to be opposed to. Both definitions need to improved to close loopholes.

We can also write the definition of a coordinate variable to say that any 1D variable with a matching dimension name shall be considered to be a coordinate variable, and then state that such a variable will be considered to be an invalid coordinate variable if it has non-numeric type, contains fill or missing values, or is not strictly monotonic.

martinjuckes commented 5 years ago

Regarding (1) the current conformance rule is clear: "A coordinate variable must have values that are strictly monotonic (increasing or decreasing)" is specified as a requirement on coordinate variable values, not a means of identifying a coordinate variable. If you try it, the CF Checker returns "Error: (5): co-ordinate variable not monotonic". I believe this is the intended interpretation of the wording in the CF Convention.

There is some ambiguity in the wording in the CF Convention, but NUG is clear that "A variable with the same name as a dimension is called a coordinate variable" and that monotonicity is a rule applied to coordinate variables.

Regarding (2): I haven't seen anything from Karl or Jonathan arguing for a change in the convention regarding char time(time,len). The convention has a clear recommendation against using this construction in a multidimensional coordinate variable, and I haven't seen either a use case for changing it. I have seen them argue against allowing string basin(basin) in any form.

JonathanGregory commented 5 years ago

Dear @martinjuckes and @JimBiardCics

Yes, I agree that we recommend against char time(time,len) or any multidimensional coordinate variable whose name is the same as the name of any of its dimensions, because of the possible confusion with NUG coordinate variables. I don't see a reason to change that.

Although it makes no difference to the meaning of the convention, would the text be clearer if we didn't use the unqualified phrase "coordinate variable", because of the confusing situation that "auxiliary coordinate variable" is not a special type of "coordinate variable"? Instead, we could call them "NUG coordinate variable" everywhere in the document, or we could follow the CF data model and call them "dimension coordinate variable".

Cheers

Jonathan

martinjuckes commented 5 years ago

Hi Jonathan,

I like the idea of replacing the phrase coordinate variable with something which is clearly (in a lexical sense) disjoint from auxiliary coordinate variable, but I don't think we should use NUG coordinate variable now that we have a slight divergence of requirements. I'd be happy with dimension coordinate variable.

Can you comment on the construction string basin(basin)? I would prefer this to be considered as a coordinate variable with an invalid type, while Jim is suggesting a slightly different interpretation: namely (I believe), that it might be dis-allowed as a coordinate variable and would instead be interpreted as a valid data variable.

JonathanGregory commented 5 years ago

Dear Martin

I agree that string basin(basin) cannot be a dimension coordinate variable because of its type, as you say. At the moment we recommend against multidimensional coordinate variables with a name matching any of their dimensions to avoid confusion with dimension coordinate variables. That is only a recommendation, but since the confusion is potentially greater for one-dimensional string variables whose name is the same as their dimension, like string basin(basin), I think we should prohibit them, so they give an error, not just a warning. (As I said above, it seems a bit of a shame to disallow a valid netCDF construction, but the CF data model doesn't allow string-valued dimension coordinate variables, which is what this looks like.)

Best wishes

Jonathan

JimBiardCics commented 5 years ago

@JonathanGregory When you say

Yes, I agree that we recommend against char time(time,len) or any multidimensional coordinate variable whose name is the same as the name of any of its dimensions, because of the possible confusion with NUG coordinate variables.

do you mean that you are happy to leave it as a recommendation rather than a prohibition? A recommendation is not a requirement. The CF definition of recommendation specifically states that software should not depend on a file conforming to a recommendation, so recommendations are, in effect, statements along the lines of "it looks nicer this way".

JimBiardCics commented 5 years ago

I'm fine if we want to provide a cleaner naming to differentiate the types of coordinate variables. I think primary coordinate variable might be the best match to auxiliary coordinate variable.