cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
87 stars 45 forks source link

grid cells with a varying number of cell bounds #163

Closed taylor13 closed 3 months ago

taylor13 commented 5 years ago

Pull request 521


There was some discussion beginning in 2017 (thread begins here) and continuing in 2018 (beginning here) about how to handle the following:

Consider a (flattened) grid consisting of both triangular and rectangular grid cells with n total grid cells. In this case the cell_bounds for longitude should be dimensioned [n, 4] because there are a maximum of 4 cell vertices.

My question is: for the triangular grid cells, what value should be assigned to the 4th vertex (corner)? Should it be set to missing_value?

There seemed to be no strong objection to defining both the missing_value and _FillValue as attributes of the bounds variable and assigning the same value to both, which would also then occupy the 4th element of the triangular grid cell vertices.

As far as I know, no formal action was taken on this, but for CMIP6 we did recommend that netCDF files containing CMIP6 output be written consistent with the above guidance.

Subsequently, I have learned that another strategy has been adopted by some "remapping" codes (i.e., "regridding packages"). They expect the unneeded vertices of a "bounds" variable be filled with the same value as the last needed value. So, for example, if a triangular grid cell has vertices located a, b, and c, the bounds variable (dimensioned 4 to accommodate other quadralateral grid cells) would be filled with the values [a, b, c, c], the c being duplicated to occupy the unneeded 4th element.

I think this approach is superior to filling unused vertices with missing_value or _FillValue. If adopted, we could easily explain it in a sentence.

martinjuckes commented 5 years ago

It sounds reasonable, but UGRID appears to specify _FillValue ("if a face has fewer corners/edges than MaxNumNodesPerFace then the last edge indices shall be equal to _FillValue": UGRID Conventions). You are presumably interested in files that will not explicitly declare UGRID, but there is still a case for keeping consistency here.

Do you know how widespread the use of the a,b,c,c approach is?

taylor13 commented 5 years ago

@martinjuckes: No I don't know how widespread it is. Can others help out here?

I would note that UGRID defines lots of "index" arrays, where perhaps an integer missing value makes it easier for software to handle the complicated relationships among grid cells that UGRID tries to describe. I'm not sure we need CF to be consistent with UGRID.

JimBiardCics commented 5 years ago

@taylor13 My understanding in the past has been that storing _FillValue in the unused elements of bounds arrays was most appropriate given the general approaches of netCDF and CF. My reading of NUG has been that a variable element containing the "fill" value is to be treated as unused, never touched. In contrast, a variable element with a value equal to the missing_value attribute for the variable is one that has been explicitly labeled as "no observation available", with the implication that there should/could have been an observation in that element. My reading of CF has been that this use of _FillValue applies to bounds with varying numbers of vertices. This understanding appears to agree with the way UGRID has defined their usage.

I, personally, dislike the idea of repeating trailing values. It makes it more difficult to recognize that the vertices are not, in fact, valid vertices. Using the fill value is unambiguous and doesn't require you to construct the N-D vertex set before determining that some of the vertices are degenerate and should be discarded.

taylor13 commented 5 years ago

I agree this alternative approach is really just a clever trick that would allow most software needing to work with vertices to operate without modification. Is there a case where the software would care that is was connecting two points that were co-located?

The current approach with _FillValue used for unneeded vertices would require software to identify these special non-vertices and would complicate some software.

I agree that _FillValue, rather than missing_value is the better choice if we want to go with the current approach.

JimBiardCics commented 5 years ago

@taylor13 I guess it depends on the software. I don't work with GIS or other systems that use polygons as input enough to know what the "usual" case is, nor what the impact of degenerate vertices is. From a coding standpoint, pruning empty vertices is trivial when _FillValue is used. Pruning them with repeated values is more complicated.

If there are, in fact, lots of packages out there that expect and properly handle degenerate vertices, then perhaps we should allow that option, but only if those packages also read and properly use netCDF content with no additional code required. I will point out that we regularly adopt the attitude that CF does things "the CF way" and that software needs to adapt to CF, rather than the other way around.

martinjuckes commented 5 years ago

I've checked the OGC standard which is used, apparently, by most GIS software, and they appear to have another approach: repeating the first value to indicate a closed polygon (i.e. a,b,c,a instead of a,b,c,c). This arises naturally in OGC because a polygon is a special case of a line. A line has a start and an end (naturally) and it is turned into a polygon if the start and the end are identical. Unfortunately, it doesn't help us here because it would require 5 data points to represent a standard four-sided grid cell, and we really don't need that.

I agree with @JimBiardCics that using the _FillValue is reasonably easy from a software perspective. The only problem I can see is that it creates an overhead if software always has to check to see if there are triangles in a grid: it is not a huge cost, but possibly an irritant given that many grids will not have any triangles. There may be a case for using missing_value: it could be specified that missing_value must be defined if the grid has some anomalous cells, allowing software to conclude that the grid is uniform if the missing_value is absent. _FillValue is always defined, and so can't be used in this way.

In response to the question from @taylor13 about UGRID: I think that having a conflict with UGRID could cause problems. There is a proposal to integrate UGRID into the CF conventions (#153). The two use-cases could possibly be kept apart (since the UGRID rules appear to apply only when MaxNumNodesPerFace is specified) but I think it would be cleaner to maintain consistency. Software that is reading CF will then have a simple set of rules (having to check for all of _FillValue, missing_value and repeated values would be a burden) and will have to translate if it is using libraries with different approaches. I guess if you are plotting points you don't need to know when a value is repeated .. you can just plot it again .. but, often, I think software that is using the data would need to know whether the cell is a square or a triangle.

JimBiardCics commented 5 years ago

@martinjuckes The repetition of the first value to indicate a closed polygon doesn't have anything to do with the issue at hand. I can see your point about interpreting the presence of a missing_value attribute as indicating the need for more careful handling of the bounds variable, but this has the minor downside of adding another semantic to the missing_value attribute. It crosses my mind to wonder how all of this may interact with the use case where there is a missing measurement that leads to missing auxiliary coordinate values and bounds vertices that are associated with that measurement. We need to tread carefully.

We could address the potential overhead issue by declaring a new attribute which indicates whether or not the number of vertices in the bounds variable is variable or fixed.

I agree that the issue is deeper than a question of redundant plotting of points or zero-length lines. It may be important for a software package to know the sided-ness of each bounding polygon.

martinjuckes commented 5 years ago

@JimBiardCics : I think the GIS standard is relevant because it shows that there are other approaches to presenting a list of polygons with varying numbers of vertices. We still don't know how wide-spread the usage described by @taylor13 is, but I'm now confident that it does not extend to GIS (a negative result, but still a small step forward).

I agree that a range of problems can be avoided or mitigated by insisting on or recommending the use of a new attribute to declare when there are variable numbers of vertices.

I also agree with the point that there is potential conflict with the broader, pre-existing meaning of either _FillValue or missing_value. We could allow different missing_value values for different purposes: this is already mentioned as a possibility in table A1 of the convention, but would add some undesirable complexity when it comes to decoding the information. An alternative approach would be to argue that a data provider should know all the coordinates of a grid before trying to put data in CF, and hence the scope for conflicting interpretations of these attributes is acceptably small.

JimBiardCics commented 5 years ago

@martinjuckes The repeated first vertex policy doesn't address the question of varying numbers of vertices in a fixed array. The GIS approach differentiates an open polyline from a closed polygon, which is something we aren't concerned with in bounds variables.

I don't think we can assume that the data grid is always known a priori. It's certainly true for models, but it may well not hold for data acquisition scenarios.

martinjuckes commented 5 years ago

@JimBiardCics : I hope we are agreed that differentiating a triangle from a rectangle is the main topic here, with a side discussion on how it is done in other communities and standards, and a few other interesting tangents as usual.

Do you have a particular data acquisition scenario in mind? We are discussing grid cells in a regular grid, so I believe there is an assumption of some regularity in the structure.

JimBiardCics commented 5 years ago

@martinjuckes, I don't have a particular use case in mind. I think we shouldn't assume that the only use case is a regular grid that is known beforehand. There's no particular reason to assume it, is there?

martinjuckes commented 5 years ago

@JimBiardCics : I'm not sure. We have to deal with use cases we have. On the other hand, it is explicitly permitted to have missing values in auxiliary coordinates, and these auxiliary coordinates may have bounds attached, so I suppose there is an argument that it is implicit that the bounds could have missing values. Still. I'm reluctant to propose a change to the convention based on a hypothetical use case.

JimBiardCics commented 5 years ago

@martinjuckes, I think it's a great idea to add language to the bounds section in CF that explicitly describes how to handle varying number of vertices. The way to do it is already defined. It is not explicitly stated in relation to bounds vertices, but there is already a clear designator for unused variable elements—_FillValue. From the NUG, Appendix A

The fill value is returned when reading values that were never written.

and

Generic applications often need to write a value to represent undefined or missing values. The fill value provides an appropriate value for this purpose ...

My further argument against using missing_value is definitely hypothetical, but it goes to the point that using missing_value instead of using _FillValue adds a usage restriction that is not necessary.

In summary, UGRID got it right, and is following CF with the mechanism they are using. Let's absolutely add clear language about how to use _FillValue to designate unused bounds vertices and remove any uncertainty about it. Let's not make a change to CF beyond explaining the existing mechanism, particularly not in a way that would clash with the language in UGRID.

martinjuckes commented 5 years ago

@JimBiardCics , OK, I would be happy to accept the status quo, with some clarification. I think I misunderstood your interpretation of the distinction between _FillValue and missing_value, because I thought your objection to "adding another semantic to missing_value" applied equally to _FillValue. Are you suggesting that _FillValue should be interpreted as something like Intentionally Blank and missing_value as Something went wrong here?

That would make sense, but I think it goes beyond what is specified in CF or NUG. Older versions of the NUG had precisely the reverse interpretation: e.g. Version 2.3 of the NUG:

For example, it might be appropriate to use _FillValue to mean that data that was expected never appeared, but missing_value where the creator of the data intends data to be missing, as around an irregular region represented by a rectangular grid.

I know this is ancient, but it shows that people have had a different interpretation. The current NUG retreats from this and simply says that missing_value can be used to convey application specific information.

JimBiardCics commented 5 years ago

@martinjuckes, it's interesting to see how different people read the same text and get different things from it. (And it doesn't help when the text changes over time!) Based on the version(s) of the NUG that I am familiar with, I've always worked on the assumption that _FillValue was the appropriate "no data was ever here" value (ragged arrays, for example), and have used missing_value for instances where my data acquisition system was unable to calculate a value (low quality or missing input, etc.). In fact, I've even had cases where I made the missing_value attribute an array attribute and encoded different algorithm failure reasons that way.

martinjuckes commented 5 years ago

@JimBiardCics very true. It is certainly true that _FillValue means "no data was ever here", and that missing_value is intended to support use cases where people want to actively insert something to convey some meaning, as you describe. I still think it would be valid adopt this approach to convey the meaning "this part of the data array is redundant because the cell has less than 4 vertices", but I accept your point that there is no real benefit in doing this, and some disadvantages. It might work if CF had some mechanism for conveying specific meanings associated with one or more values of the missing_value attribute (e.g. a missing_value_meanings attribute), but I don't think we have, at this point, a sufficient motivation to introduce such an extension to the existing convention.

JimBiardCics commented 5 years ago

@martinjuckes I think it's possible to do what you are talking about with a combination of missing_value and flag_value/flag_meaning. This is non-standard, as flag_value and flag_meaning are currently marked in Appendix A as being applicable only to data variables, but it is technically doable with a change to the scope for those two attributes.

Given that, I don't think it's particularly necessary. I think an "empty" cell is a sufficient marker.

ChrisBarker-NOAA commented 4 years ago

@taylor13 asked: " Is there a case where the software would care that is was connecting two points that were co-located?"

Absolutely. If you are simply drawing it, then duplicated points are a non-issue. But if you are doing computations on the polygon: interpolating, computing fluxes over cell bounds, putting them into a spatial data structure for searching, any number of things -- degenerate cells are a real problem.

On the other hand, at least some plotting software (e.g. matplotlib) is perfectly happy with missing values :-)

And with floating point issues, I'd really hate to have to test equality to know a quad was really a triangle.

BTW: while refering to GIS standards makes sense in some cases, they actually are really bad at representing grids, and shared vertices in general, so we should not assume they have it right for CF use cases.

taylor13 commented 10 months ago

Hi all,

I think we all ran out of steam on this issue 3 years ago, but without too much effort we might come to a consensus on how to proceed. From my reading (which should be checked) of the discussion I propose that we provide the following guidelines:

For grids constructed from cells that do not all have the same number of sides (e.g., some rectangular cells and some triangular cells), the cell_bounds must be dimensioned to accommodate the maximum number of cell vertices. For cells with fewer than the maximum number of vertices, the unneeded elements in cell_bounds should be assigned the _FillValue.

We can improve on the language, but please say if you disagree with this approach.

Also, I have not really examined whether this suggestion is consistent with ugrid, although Martin has provided a summary above.

thanks, Karl

davidhassell commented 10 months ago

Hi Karl,

Sounds good to me. This suggestion is indeed consistent with UGRID, provided we also clarify that the unneeded indices must occur last in the bounds dimension.

Thanks, David

JonathanGregory commented 10 months ago

Thanks for returning to this, Karl @taylor13. I agree with your proposal.

taylor13 commented 9 months ago

No objections and some support have been voiced. I propose the following slightly modified text (to indicate the unneeded indices must occur last):

"For grids constructed from cells that do not all have the same number of sides (e.g., some rectangular cells and some triangular cells), the cell_bounds must be dimensioned to accommodate the maximum number of cell vertices. For cells with fewer than the maximum number of vertices, the unneeded elements should appear as the last elements in cell_bounds and should be assigned the _FillValue."

Please feel free to suggest edits if this can be improved.

ChrisBarker-NOAA commented 9 months ago

This all looks good -- but does this text belong in the CF doc? or in the UGRID spec, where it already says:

""" The case of a 2D mesh with mixed face sizes is identical to the 2D triangular mesh discussed above with the exception that not all faces have the same number of nodes. To support this variability we may use in the future a ragged array, but here we propose to use _FillValue to indicate faces with smaller number of nodes than the arrays allow. """ https://ugrid-conventions.github.io/ugrid-conventions/#2d-flexible-mesh-mixed-triangles-quadrilaterals-etc-topology

NOTE; in practice I've usually seen -1 used for the "missing node" value -- should we mention that as a common, but not required, practice?

maybe not clear enough (and we need to edit the doc ti change "we propose" to something more definite.

In

JonathanGregory commented 9 months ago

@ChrisBarker-NOAA. I think it belongs in the CF doc, because this can arise for CF metadata which isn't UGRID. That's how this issue began. Of course, it's desirable, and good, if the same convention is used for UGRID and non-UGRID in CF.

@taylor13. Thanks. I think the amended text is fine. Would you like help with doing a PR?

JonathanGregory commented 8 months ago

Dear Karl @taylor13

Three weeks have passed with no objections, and enough support has been given, so this change is accepted. We need a pull request to implement it. Karl, would you like to do this, with help as required?

Best wishes and thanks

Jonathan

JonathanGregory commented 5 months ago

Dear all

In this issue we agreed a proposal of Karl's three months ago and since then has been waiting for someone to write a PR. We decided on the text to add, but I don't think we decided where to put it. It seems to me that it should be appended to the first paragraph of sect 7.1. However, looking at the context, I feel we should also insert another couple of sentences to explain how many vertices one might expect there to be. That is not otherwise made clear until further down, when we describe some cases explicitly.

I suggest that we introduce a new term, because I think it helps with the new sentences, in the terminology section 1.3:

vertex dimension: The dimension of a boundary variable along which the vertices of each cell are ordered.

The first paragraph of 7.1 is currently

To represent cells we add the attribute bounds to the appropriate coordinate variable(s). The value of bounds is the name of the variable that contains the vertices of the cell boundaries. We refer to this type of variable as a "boundary variable." A boundary variable must have one more dimension than its associated coordinate or auxiliary coordinate variable. The additional dimension must be the most rapidly varying one, and its size is the maximum number of cell vertices.

In the above, I propose we change should to must (the two bold words), because these are requirements (as the conformance document indicates).

I propose we append the following to this paragraph:

We refer to the additional dimension as the "vertex dimension". For one-dimensional coordinate variables, the size of the vertex dimension must be two. For multidimensional coordinate variables, its size must be greater than two. For grids constructed from cells that do not all have the same number of sides (e.g., a grid with some rectangular cells and some triangular cells), the boundary variable must be dimensioned to accommodate the maximum number of cell vertices. For cells with fewer than the maximum number of vertices, the unneeded elements should appear as the last elements in the vertex dimension and should be assigned the _FillValue.

In the above, the three bold sentences are what I've added. The other bold bits are insertions or substitutions in Karl's text.

In your final sentence, Karl, there are two "should"s. Are they recommendations or requirements? In either case, we could add them to the conformance document.

In the conformance document, we have

A boundary variable must have the same dimensions as its associated variable, plus have a trailing dimension (CDL order) for the maximum number of vertices in a cell.

I think that's unaffected by the new convention, but I propose to append the folllowing to it:

The trailing dimension must be of size two if the associated variable is one-dimensional, and of size greater than two if the associated variable has more than one dimension.

I have prepared PR 521 to implement the above changes. I realise this goes a bit beyond what we agreed, but I hope not to the extent that we should revisit the decision. I believe my additions and changes are clarification to the proposed text.

Best wishes

Jonathan

taylor13 commented 5 months ago

Thanks, Jonathan, for initiating the pull request. Regarding

In your final sentence, Karl, there are two "should"s. Are they recommendations or requirements? In either 
case, we could add them to the conformance document.  

I suppose if we say "must", that would be less backward compatible. For future datasets it sure would be good for everyone to handle this as we've suggested, so maybe we should change this to "must", even though older datasets may be non-compliant.

What do others think?

ethanrd commented 5 months ago

Hi Jonathan @JonathanGregory,

You suggested the following text:

For one-dimensional coordinate variables, the size of the vertex dimension must be two. For multidimensional coordinate variables, its size must be greater than two.

Is there a reason not to specify that the size of the vertex dimension for an n-dimensional coordinate variable has to be n+1? Also, could the "its" in the second sentence be expanded to "the size of the vertex dimension? I don't think it's confusing what the "its" refers to but for specifications it seems useful to be explicit.

Later you suggest

The trailing dimension must be of size two if the associated variable is one-dimensional, and of size greater than two if the associated variable has more than one dimension.

I have a similar question as above on specifying a size of n+1.

JonathanGregory commented 5 months ago

Dear Ethan

I wondered also about repeating the phrase to be explicit, but it sounded cumbersome! We do have the conformance document as well, just in case there is any doubt. What do you think of

If the associated variable is one-dimensional, the size of the vertex dimension must be two; if multidimensional, greater than two.

That has even less repetition, and to me it seems unambiguous. If "it" is omitted, you can't worry about what "it" refers to! It's like a function definition, supplying only the arguments on the second call.

At present, the convention for boundary variables only describes one- and two-dimensional (in a spatial sense) cells. Two-dimensional boundaries are faces, which must have more than two vertices. There's a later part of the text in sect 7.1 which I've worried about before:

Bounds for multi-dimensional coordinate variables with p-sided cells

In all other cases, the bounds should be dimensioned (...,n,p), where (...​,n) are the dimensions of the auxiliary coordinate variables, and p the number of vertices of the cells. The vertices must be traversed anticlockwise in the lon-lat plane as viewed from above. The starting vertex is not specified.

Since it talks about traversing the vertices clockwise, I think it must still be concerned with faces. For instance, it's not talking about the eight vertices of a cube which isn't aligned with the coordinate axes, where each of the vertices needs its own 3D coordinates. Multidimensional cells have two-dimensional faces. In a 3D grid of tilted cube cells, each cell has three indices (i,j,k). Each of its six faces could be a cell of the grid, with 3D coordinates x(i,j,k,v), y(i,j,k,v) and z(i,j,k,v), where v is the vertex index. They have the form (...,n,p) as mentioned in the text quoted. But that's not enough indices for this situation, because it doesn't tell us which of the faces of cell (i,j,k) we mean.

I suspect that this part of the text is flawed, and we should remove it if so. For bounds in more than two dimensions, this convention isn't inadequate. UGRID could be used. But perhaps I have misunderstood the idea?

If I'm right, it should be dealt with in a separate issue. The only connection with the present issue is in the text I inserted. Perhaps boundary variables should be disallowed for auxiliary coordinate variables of more than two dimensions, and the statement above should instead be

If the associated variable is one-dimensional, the size of the vertex dimension must be two; if two-dimensional, greater than two.

Best wishes

Jonathan

JonathanGregory commented 4 months ago

Dear all

We got stuck with this five weeks ago. Ethan suggested replacing "its" at one point to be explicit. Instead, I've adopted the same word order as in the conformance document, which avoids this difficulty, and I think it's good to be consistent anyway. I have updated the PR with this text now appearing in section 7.1 on "Cell boundaries":

A boundary variable must have one more dimension than its associated coordinate or auxiliary coordinate variable. We refer to the additional dimension as the "vertex dimension". The vertex dimension must be the most rapidly varying one, and its size is the maximum number of cell vertices. The vertex dimension must be of size two if the associated variable is one-dimensional, and of size greater than two if the associated variable has more than one dimension. For grids constructed from cells that do not all have the same number of sides (e.g., a grid with some rectangular cells and some triangular cells), the vertex dimension must be at least as large as the maximum number of cell vertices. For cells with fewer vertices than the size of vertex dimension, the unneeded elements should appear as the last elements in the vertex dimension and should be assigned the _FillValue.

You may notice that I've changed the last two sentences to allow the vertex dimension to be greater than the maximum number of cell vertices. For instance, if you have a grid with triangles and squares, you could make the vertex dimension 6 instead of 4, in case you want to add hexagons later. The alternative is to say it should be the present maximum, but that would imply either recommending or requiring boundary variables to be resized if you took a subset of the cells which didn't include any with the maximum number of vertices, for example. That strikes me as unnecessary. If we can cope with unused elements at the end of the vertex dimension, it doesn't matter how many there are. It's up to the data-writer how much they care about saving space, I would say.

I don't think this point was discussed before, and I hope it's not controversial. What do you think? If the above paragraph is acceptable, we can merge the PR to achieve what Karl proposed in this issue. Is it OK with you, Ethan @ethanrd and Karl @taylor13?

Ethan asked another question, about whether we could say coordinate variables of n dimensions should generally have boundary variables of n+1 dimensions. As I mentioned before, I suspect that the convention of boundary variables of dimension greater than two is flawed. This is a different issue from the one Karl raised, which I don't want to delay. I will therefore open another issue to address it.

Best wishes

Jonathan

davidhassell commented 4 months ago

Hello,

For cells with fewer vertices than the size of vertex dimension, the unneeded elements should appear as the last elements in the vertex dimension and should be assigned the _FillValue.

I think this ought to a "must appear as the last elements", otherwise this text is at odds with the UGRID text, and that part of the UGRID text is part of CF.

I suspect that this part of the text is flawed, and we should remove it if so. For bounds in more than two dimensions, this convention isn't inadequate. UGRID could be used. But perhaps I have misunderstood the idea?

UGRID is also uncertain here (as concluded by issue https://github.com/ugrid-conventions/ugrid-conventions/issues/53), which was why we excluded UGRID 3-d volumes from incorporation into CF.

JonathanGregory commented 4 months ago

For cells with fewer vertices than the size of vertex dimension, the unneeded elements should appear as the last elements in the vertex dimension and should be assigned the _FillValue.

I think this ought to a "must appear as the last elements", otherwise this text is at odds with the UGRID text, and that part of the UGRID text is part of CF.

OK, @davidhassell. Karl thought "must" might be the correct choice, and you have provided a reason. If this convention hasn't been allowed before, I think that's OK even for our generous interpretation of backward compatibility. I have changed that sentence from "should" to "must". It must be _FillValue because we don't suggest any other way of identifying unused elements. Consequently I have inserted this new requirement in the conformance document:

Any elements of the boundary variable which contain the FillValue must form a consecutive block at the end of the trailing dimension.

I have updated the PR. Is it OK, Karl @taylor13 and @ethanrd?

I suspect that this part of the text is flawed, and we should remove it if so. For bounds in more than two dimensions, this convention isn't inadequate. UGRID could be used. But perhaps I have misunderstood the idea?

UGRID is also uncertain here (as concluded by issue ugrid-conventions/ugrid-conventions#53), which was why we excluded UGRID 3-d volumes from incorporation into CF.

Whether or not UGRID can handle it, the convention is unclear, I think. I have opened #527 about this.

ethanrd commented 4 months ago

@JonathanGregory - The PR looks good to me. Thanks.

ChrisBarker-NOAA commented 4 months ago

All looks good to me, but a related question:

"The vertex dimension must be the most rapidly varying one"

I'm pretty sure similar language is present elsewhere in CF, but this seems odd to me -- Fortran and C use different data storage conventions, so that the "most rapidly varying one" is different.

But is it always the same in netCDF (i.e. the last)? -- if so, shouldn't we just say that? If not, how is a less savvy user (or a compliance checker) to know what the most rapidly varying one is?

If this is well documented somewhere, great, but I see the term three places in CF, and none of them define what it means, though there is this:

"... a variable of type string with n dimensions, or as a variable of type char with n+1 dimensions where the last (most rapidly varying)..."

which implies that the last is indeed always the most rapidly varying -- so maybe we should use that language everywhere?

JonathanGregory commented 4 months ago

Dear @ChrisBarker-NOAA

Thanks. I understand "most rapidly varying" index to mean the one which varies by 1 for the addresses of adjacent locations in storage, i.e. the first index in Fortran, the last in C and CDL. The phrase was probably used in order not to trip people up e.g. by saying "last" if they weren't aware that CDL is like C. But I can see that it could be confusing, and we should maybe change or define it, or both.

It would be very helpful if you or someone else could open a defect issue about this, so we can discuss it separately from the present issue. Then we can agree this issue with the present wording, and change it afterwards when we've thought about the problem you raise.

Best wishes

Jonathan

taylor13 commented 4 months ago

Looks good to me, Jonathan. Thanks!

We could anticipate the outcome of the issue just raised by Chris by changing

The vertex dimension must be the most rapidly varying one, and its size is the maximum number of cell vertices.

to read something like

The vertex dimension must be the most rapidly varying one (i.e., the first variable dimension 
when coding in C), and its size is the maximum number of cell vertices.

but I leave that up to you.

ChrisBarker-NOAA commented 4 months ago

wait:

"The vertex dimension must be the most rapidly varying one (i.e., the first variable dimension when coding in C), and its size is the maximum number of cell vertices."

Isn't it the last when coding in C? ( I know I have to think carefully about it each time ...) -- anyway, that's not relevant anyway, as regardless of how you code it, it should always be the same in the netcdf, yes?

I've opened a defect issue (#530), as Jonathan suggested, though having looked more, there's maybe little to do.

For this topic, I suggest:

The vertex dimension must be the last (most rapidly varying) dimension, and its size is the maximum number of cell vertices.

That would be consistent with the text describing character arrays.

taylor13 commented 4 months ago

Yes, of course. Brain fog set in early today. And your wording is correct.

JonathanGregory commented 4 months ago

As I've suggested in #530, I have updated the text in the PR #521 to read "The vertex dimension must be the last dimension in CDL order (the most rapidly varying dimension)". OK?

ChrisBarker-NOAA commented 3 months ago

LGTM.

JonathanGregory commented 3 months ago

Three weeks have passed with no further concerns. Enough support was expressed according to the rules for this to be accepted. Therefore I will merge the PR.