"mesh variable" instead of "boundary variable" for contiguous grid cells

rabernat commented 5 years ago

I work every day with ocean models that use orthogonal curvilinear coordinates (MITgcm, MOM, POP, ROMS, NEMO, etc. etc.). This is an example tripolar grid from CESM:

The grid cells in such models are contiguous quads, with four points specifying the lat / lon vertex locations of each cell. CF conventions tell me (Section 7.1: Cell Boundaries) that I should use a boundary_variable.

A boundary variable will have one more dimension than its associated coordinate or auxiliary coordinate variable.

In the case where the horizontal grid is described by two-dimensional auxiliary coordinate variables in latitude lat(n,m) and longitude lon(n,m), and the associated cells are four-sided, then the boundary variables are given in the form latbnd(n,m,4) and lonbnd(n,m,4), where the trailing index runs over the four vertices of the cells

This convention is general enough to accommodate potentially overlapping or non-contiguous quads, essentially n x m totally unrelated four-sided shapes.

My main point: It's inefficient to store structured grid geometry this way.

The bounds can be used to decide whether cells are contiguous via the following relationships...

I don't want to have to check this, I want the conventions to tell me. In our latest global high-resolution ocean models, I have a mesh that is of size n=12960, m=17280, 223 million cells. I am interested in streamlining my analysis and visualization workflow as much as possible, which means minimizing the required memory and computational steps.

Instead of specifying a boundary variable, I propose to introduce the concept of a mesh variable, with the following conventions:

A mesh variable will have the same number of dimensions as its associated coordinate or auxiliary coordinate variable, but with one extra element in each dimension.
In the case where the horizontal grid is described by two-dimensional auxiliary coordinate variables in latitude lat(n,m) and longitude lon(n,m), and the associated cells are four-sided and contiguous, then the mesh variables are given in the form latmesh(n+1, m+1) and lonmesh(n+1, m+1).

It would not be hard to generate such data, since this is how most GCMs keep track of their own coordinate grids internally (e.g. MITgcm). This convention also aligns well with how most visualization software plots such data, e.g. matplotlib's pcolormesh function. So adding something like this to the CF conventions would streamline the path from model output to plotting, eliminating the potentially error-prone step of encoding, and then decoding, the "boundary variable" type coordinates. For the dataset I described above, the difference is about 3 GB of memory.

I don't feel strongly about what it's called. Maybe "mesh variable" is not the right choice. But I feel something like this is sorely needed.

cc @adcroft & @StephenGriffies, with whom this topic has come up repeatedly.

adcroft commented 5 years ago

@rabernat makes very good points and I agree with everything said.

The concept of a space-filling and contiguous mesh is fundamental to many finite-volume codes and I find it baffling that a more efficient description has not already replaced the current description.

The names of the mesh coordinates do not need to be fixed. They can be named in an attribute, e.g. :mesh, attached to the variable. Stylizing on the :coordinates attribute from 5.2. Two-Dimensional Latitude, Longitude, Coordinate Variables an example could look something like this:

dimensions:
  i = 128 ;
  j = 64 ;
  k = 18 ;
  im = 129 ;
  jm = 65 ;
  km = 19 ;
variables:
  float thetao(k,j,i) ;
    thetao:long_name = "seawater_potential_temperature" ;
    thetao:units = "C" ;
    thetao:coordinates = "lon lat z" ;
    thetao:mesh = "meshlon meshlat zi" ;
  float lon(j,i) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(j,i) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;
  float z(k) ;
    z:long_name = "elevation" ;
    z:units = "m" ;
    z:positive = "up" ;
  float meshlon(jm,im) ;
    meshlon:long_name = "longitude" ;
    meshlon:units = "degrees_east" ;
  float meshlat(jm,im) ;
    meshlat:long_name = "latitude" ;
    meshlat:units = "degrees_north" ;
  float zi(km) ;
    zi:long_name = "elevation" ;
    zi:units = "m" ;
    zi:positive = "up" ;

Several notes:

I threw away the unused 1D coordinate variables for brevity.
I'm assuming the coordinates attribute can be used for more than just the horizontal coordinates longitude-latitude. I think that's legitimate but didn't find an example in a quick scan.
:mesh_coordinates might be a better name for the attribute but I like shorter names myself.
There should probably be a global attribute specifying the version of the standard being used in the file.

At the risk of derailing the thread I'll mention that we can arrive at an even more succinct description if we discard the idea of data coordinates which I think is unnecessary when providing the finite volume mesh information:

dimensions:
  i = 128 ;
  j = 64 ;
  k = 18 ;
  im = 129 ;
  jm = 65 ;
  km = 19 ;
variables:
  float thetao(k,j,i) ;
    thetao:long_name = "seawater_potential_temperature" ;
    thetao:units = "C" ;
    thetao:mesh = "meshlon meshlat zi" ;
  float meshlon(jm,im) ;
    meshlon:long_name = "longitude" ;
    meshlon:units = "degrees_east" ;
  float meshlat(jm,im) ;
    meshlat:long_name = "latitude" ;
    meshlat:units = "degrees_north" ;
  float zi(km) ;
    zi:long_name = "elevation" ;
    zi:units = "m" ;
    zi:positive = "up" ;

I believe the latter example to be a more appropriate description of the data. However, I realize that the latter is not backward compatible since there is an expectation that there are coordinates and will cause trouble if they are missing. So I will be pleased enough if we can at least adopt an attribute that identifies mesh coordinates as in the earlier example.

rabernat commented 5 years ago

I was hoping for some sort of response or discussion from the maintainers. Have we raised this issue in the wrong place? Would it be better off in the main cf-conventions repo issue tracker (rather than the discussion repo)?

adcroft commented 5 years ago

I don't know - I'm somewhat lost on where and what things are. It does seem that https://github.com/cf-convention/cf-conventions/issues is both active and has proposals being discussed.

feggleton commented 5 years ago

Hi, I maintain the standard name table and I am under the impression that https://github.com/cf-convention/cf-conventions/issues is for general cf convention discussions and https://github.com/cf-convention/discuss/issues is for standard name requests and discussions. I shall ask one of the other maintainers to confirm too. Apologies for the confusion.

rabernat commented 5 years ago

The readme for this repo says the following:

A forum for proposing standard names; and any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.

https://github.com/cf-convention/discuss/blob/master/README.md

feggleton commented 5 years ago

I must be wrong. I will leave this to one of the other maintainers to respond to. Hopefully someone responds to your discussion soon.

JonathanGregory commented 5 years ago

This is the correct repo for discussion about possible changes to conventions, if you're not at the stage of proposing a change. It takes over from the UCAR CF email list, which is now mothballed. When you have a specific proposal to make, it should be done as an issue in the conventions repo, which has taken over from the old CF trac tickets.

rabernat commented 5 years ago

Thanks @JonathanGregory for the clarification! 😄

I opened this issue specifically for discussion. We are looking for feedback from the CF community about this idea.

If no one has any feedback, I suppose we will just move forward with making a proposal over in the cf-conventions repo.

JonathanGregory commented 5 years ago

Let's see. Not everyone will have seen this yet, since they may not be watching this repo, as it has just been introduced. I have to confess I am one of those! I hadn't seen this question before just now and therefore haven't yet thought about it.

rabernat commented 5 years ago

Sure, no rush! We'll leave this discussion open as long as needed.

JimBiardCics commented 5 years ago

@rabernat @adcroft I thought I would drop in some background. The existing CF coordinate conventions were designed to provide maximum flexibility for a wide variety of use cases. Coordinate tuples can describe points or cells. Cells can have arbitrary shapes, can be non-contiguous, and can overlap. If the coordinate variables are not 1D, there's no requirement that the coordinates describe a mesh grid of any sort. There is no prescribed relationship between a given coordinate tuple for a cell — (lat,lon) for example, and its relative location within the cell it is referencing. (There is a loose assumption that a coordinate tuple without any bounds describes the center of a cell, but that is not required or guaranteed.)

This flexibility comes with a cost in the form of the complexity (and bulkiness) of the representation. I think CF has done a pretty reasonable job of this overall.

JimBiardCics commented 5 years ago

@rabernat @adcroft There have been a number of times when people have requested a simplified representation of coordinates when there is a straightforward mesh grid where cells are contiguous and non-overlapping. This amounts to representing the grid cells by specifying the vertices of the cells. This can work whether the coordinate variables are 2D (as in your example), 1D, or N-D.

Such a scheme is quite compact, but it does represent a significant departure from previous assumptions. Software (and humans) that don't notice that a different scheme is in use will have serious problems. The coordinates attribute is the appropriate way to define the relationship between the mesh coordinate variables and a data variable.

The mesh coordinate variables you are suggesting would not be recognized as proper auxiliary coordinates by existing CF-compliant software because the dimensions don't match the dimensions of the data variable. This isn't a problem. I'm just stating the facts. I expect that most software would either check the dimensions and throw an error because of the mismatch or attempt to proceed naïvely and have serious problems.

I think there needs to be an unambiguous marker designating a coordinate variable as a mesh coordinate variable. That way software does is not forced to rely on dimensions being +1 in size as the only marker. The two ways I see to do that is to declare the relationship with a special mesh_coordinates attribute on the data variable, or by putting a coordinate_type attribute on each coordinate variable. This attribute would declare a coordinate variable as being of type 'cell point' or 'cell vertex'. (Your names may vary. This is a suggestion.) For software aware of the new convention, the combination of dimensions of size +1 and a 'cell vertex' type would identify a coordinate variable as being a valid mesh coordinate. The default assumption is there is no coordinate_type attribute is that a coordinate variable is a classic 'cell point' type coordinate.

I think there is a valid use case for such a representation. It represents enough of a departure from existing CF that quite a bit of careful thinking will need to go into it.

davidhassell commented 4 years ago

Some general comments:

I agree with @JimBiardCics that the current flexibility comes at the expense of bulkiness, but rather than with the cost of complexity, I think that it comes with the benefit of simplicity - one methodology for any type of cells and trivial lookup of individual cell bounds.

We should always be wary of introducing a different way of doing something that can already be done, because it makes the conventions harder to understand and it makes software harder to write. However, we must also consider whether or not the potential benefits outweigh these concerns, and CF does have a history of introducing complexity to deal with ever increasing file sizes (e.g. external variables, DSGs).

That said, and given the need, I might prefer an example more like this:

dimensions:
  i = 128 ;
  j = 64 ;
  k = 18 ;
  im = 129 ;
  jm = 65 ;
  km = 19 ;
variables:
  float thetao(k,j,i) ;
    thetao:long_name = "seawater_potential_temperature" ;
    thetao:units = "degreesC" ;
    thetao:coordinates = "lon lat z" ;
  float lon(j,i) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
    lon:mesh = "meshlon" ;
  float lat(j,i) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north"
    lat:mesh = "meshlat" ;
  float z(k) ;
    z:long_name = "elevation" ;
    z:units = "m" ;
    z:positive = "up" ;
    z:mesh = "zi" ;
  float meshlon(jm,im) ; // a bounds variable shouldn't have attributes that copy those of the parent coordinate variable
  float meshlat(jm,im) ;
  float zi(km) ;

Here, software that doesn't understand meshes will still read the file without crashing. It be able to parse the data variable and all of its metadata with the only exception of the mesh bounds, that it will just ignore. The key to this is linking the bounds to the coordinates with the mesh attribute (or some other name), rather than the usual bounds attribute, so that old software will not try and do anything clever with its value - like find and interpret the referenced variable. (This is what simple geometries does. It uses nodes instead of bounds to make it clear the bounds are not "normal".) The presence of the mesh attribute on a coordinate variable signifies to new software which sizes of dimensions to expect in the referenced variable, and how to deal with them.

Also, the CF checker could easily only check meshes at the appropriate versions.

I agree that omitting the coordinate value and just storing the mesh vertices is a distraction at this stage (see https://github.com/cf-convention/discuss/issues/5#issuecomment-558366964). That would require a change to the CF data model, which is otherwise not needed.

I was a bit concerned that the mesh is fragile, in the sense that it might cease to be a mesh under certain processing or analytical operations (e.g. subspacing). But I'm thinking now that this is not a concern of the CF conventions.

rabernat commented 4 years ago

Thanks everyone for the useful feedback! 😃

A couple of key points emerged:

We definitely don't want this new concept of mesh to conflict with existing CF conventions; if anything is added, it should be 100% backwards compatible
Mesh variables cannot be coordinate variables because they don't share dimensions with the data variables

These points support @davidhassell's suggestion that mesh should simply be a new CF concept, on top of the existing ones. I like the suggestion to link to the mesh through the coordinates, which is efficient in terms of not repeating metadata. My only concern is if some datasets don't actually have coordinate variables defined but DO have a mesh defined. But I suppose that would be pretty rare.

This flexibility comes with a cost in the form of the complexity (and bulkiness) of the representation. I think CF has done a pretty reasonable job of this overall.

@JimBiardCics - thanks for this background. I agree that CF bounds in its current form does a great job at covering all possible use cases, as you described. My suggestion here is not a criticism. My concern is that, as we move into an era of global eddy-resolving / cloud-resolving climate models, the bulkiness in representing meshes using bounds will become unacceptable. (I think we are already at that point.) Modeling centers are already distributing data with these sorts of mesh variables, but without an accepted convention. It would be best for CF to accept that this is going to happen and define a reasonable convention around it.

davidhassell commented 4 years ago

A technical issue occurs to me - that of mapping the N-d mesh vertex variable dimensions to those of the parent coordinate variable (or data variable). This can not be done by dimension name, and can not rely on dimensions sizes being different - they might not be. You could state that the dimension order of a mesh variable must match that of the parent, but this is not checkable and so is far from ideal.

Here a a couple of examples to demonstrate the difficulty:

dimensions:
  i = 128 ;
  j = 128 ;
 dim1 = 129 ;
 dim2 = 129 ;
variables:
  float thetao(j,i) ;
    thetao:long_name = "seawater_potential_temperature" ;
    thetao:units = "degreesC" ;
    thetao:coordinates = "lon lat" ;
  float lon(j,i) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
    lon:mesh = "meshlon" ;
  float lat(j,i) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north"
    lat:mesh = "meshlat" ;
  float meshlon(dim1, dim2) ; 
  float meshlat(dim2, dim1) ;

  float meshlon(dim1, dim2) ; 
  float meshlat(dim1, dim2) ;

In both cases I don't know what the mesh variables dimension orders are, relative to i and j.

There's probably an elegant way round this, but I can't think right now what it is ...

neumannd commented 4 years ago

@davidhassell We could force that the order of dimensions of the mesh variables equals the order of the dimensions of the coordinate variables, and trust the users to comply with it.

Additionally, we could create a mapping from i to dim1 and j to dim2 (or the other way around) as follows:

dimensions:
  i = 128 ;
  j = 128 ;
 dim1 = 129 ;
 dim2 = 129 ;
variables:
  float thetao(j,i) ;
    thetao:long_name = "seawater_potential_temperature" ;
    thetao:units = "degreesC" ;
    thetao:coordinates = "lon lat" ;
  int i(i) ;
    i:long_name = "index of i coordinate" ;
    i:units = "1" ;
    i:mesh = "dim1" ;
  int j(j) ;
    j:long_name = "index of j coordinate" ;
    j:units = "1" ;
    j:mesh = "dim2" ;
  float lon(j,i) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
    lon:mesh = "meshlon" ;
  float lat(j,i) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north"
    lat:mesh = "meshlat" ;
  float meshlon(dim2, dim1) ; 
  float meshlat(dim2, dim1) ;

adcroft commented 4 years ago

  float meshlon(dim1, dim2) ; 
  float meshlat(dim2, dim1) ;

I hadn't realized that changing the order of indices was allowed. You're saying it's not in the standard. Is this an example where the dimension variable resolves the ambiguity? e.g. adding the dim1 dimension variable with long_name of "index of i coordinate" dictates that the dim1 dimension is in the i-direction.

adcroft commented 4 years ago

float lat(j,i) ; lat:long_name = "latitude" ; lat:units = "degrees_north" lat:mesh = "meshlat" ;

Putting the mesh attribute on the coordinate variables requires we have to delve through a level of indirection to find the mesh for a variable. For a finite volume model the mesh coordinates are more primary to the variable than the coordinate of a notional cell center so I prefer for it do be directly associated and at the same level of the coordinate attribute.

Attaching the mesh attribute to the data variable is backward compatible. Was there a reason to not add such an attribute to the data variable? This could still live besides a bounds approach (even though it would be redundant) without conflict.

I think the following resolves

the technical issue that @davidhassell raised (by putting back the dimension variable);
is backward compatible;

and provides the information that @rabernat and I are hoping to provide without inefficiency.

dimensions:
i = 128 ;
j = 128 ;
dim1 = 129 ;
dim2 = 129 ;
variables:
float thetao(j,i) ;
thetao:long_name = "seawater_potential_temperature" ;
thetao:units = "degreesC" ;
thetao:coordinates = "lon lat" ;
thetao:mesh_coordinates = "meshlon meshlat";
int i(i) ;
i:long_name = "index of i coordinate" ;
i:units = "1" ;
int j(j) ;
j:long_name = "index of j coordinate" ;
j:units = "1" ;
float lon(j,i) ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
float lat(j,i) ;
lat:long_name = "latitude" ;
lat:units = "degrees_north"
int dim1(dim1) ;
dim1:long_name = "index of i coordinate" ;
dim1:units = "1" ;
int dim2(dim2) ;
dim2:long_name = "index of j coordinate" ;
dim2:units = "1" ;
float meshlon(dim2, dim1) ; 
meshlon:long_name = "longitude" ;
meshlon:units = "degrees_east" ;
float meshlat(dim2, dim1) ;
meshlat:long_name = "latitude" ;
meshlat:units = "degrees_north";

rabernat commented 4 years ago

I really like it @adcroft! Doing it that way also allows for the common case where a variable might be a mesh coordinate for one variable (cell centered) and a regular coordinate variable for another (cell vertices). For example

dimensions:
  i = 128 ;
  j = 128 ;
 dim1 = 129 ;
 dim2 = 129 ;
variables:
  float thetao(j,i) ;
    thetao:long_name = "seawater_potential_temperature" ;
    thetao:units = "degreesC" ;
    thetao:coordinates = "lon lat" ;
    thetao:mesh_coordinates = "meshlon meshlat";
  float vort(dim1,dim2) ;
    thetao:long_name = "seawater_vorticity" ;
    thetao:coordinates = "meshlon meshlat" ;

with the rest the same as your example.

adcroft commented 4 years ago

That's a much stronger argument than my complaint about the indirection.

davidhassell commented 4 years ago

Thanks for considering my example, but I'm afraid that I don't see how the latest example (https://github.com/cf-convention/discuss/issues/5#issuecomment-561320471) works for all cases. A few concerns are: What connects dim1 to i, and dim2 to j? The mention of "i" or "j" in the long_name is unfortunately not sufficient. Also, meshlat is only connected to lat by attributes which may not always be there, or if they are there they may not differentiate between variables. What if other auxiliary coordinate variables are listed by the coordinate attribute?

In general, it is best to avoid redundancy, as it increases the chances of inconsistency.

It seems to me that the discussion might be veering away from how to save space when storing the bounds of contiguous cells, to something more general about how to encode particular types of grid (which reminds me of UGRID). That's fine, but it's hard to talk about both at the same time, I think.

I hope you don't perceive me as being too negative - I'm just trying to searching for a solution that is as robust as other part of the CF conventions!

taylor13 commented 4 years ago

I'd like to raise another possible issue: I'm pretty sure (but not positive) that CF allows one to have simple "index" dimensions where the index values start at 0 and are consecutive. For example, one would usually store a variable defined on an icosahedral "cube-sphere" grid as a function of three indices. I don't think CF requires the vectors of index values (for each of the 3 dimensions) be stored explicitly in the file. The longitude and latitude coordinate values for each grid cell would be pointed to by the coordinates attribute attached to the variable, and these coordinates would also be a functions of the 3 indices.

If this is so, then how would this be impacted by the ideas being discussed here? My reading of some of the examples above is that we would now require the index values be explicitly included in the file.

adcroft commented 4 years ago

I'm also very much in favor of avoiding redundancy. At the very least, redundancy permits inconsistency.

What connects dim1 to i, and dim2 to j?

If netcdf allowed us to declare data with shape (i+1) we would not need dim1 but unfortunately we do need it. The relationship is that the logical dimensions dim1 and i are respresenting the same thing with different values. Because dimensions have no attributes we have no way to indicate that they are related. Hence I suggested the use of a dimension variable (a 1D coordinate variable with the same name as the dimension). I've just scanned the CF documentation and now see that dimension variables are more a pattern of common usage and not in the standard. Correct? @taylor13 just responded to that effect while I was typing. So I'll abandon the suggestion of using the dimension variable to make the association of dimensions.

The order of dimensions is currently not restricted in CF (second paragraph on dimensions). @neumannd suggests we require that the order of dimensions be the same for related variables (i.e. coordinates, data, and mesh coordinates). The text on two-dimensional coordinate variables has the following line:

`T(k,j,i)` is associated with the coordinate values `lon(j,i)`, `lat(j,i)`, and `lev(k)`

which shows the same ordering of dimensions but doesn't state that they should be in the same order. I would be very surprised if different ordering is ever actually used in files, or even considered in much software. I think restricting the dimension order to be the same for related data does solve the association problem raised by @davidhassell. So I like @neumannd's proposal which happens to also render a more succinct description:

dimensions:
  i = 128 ;
  j = 64 ;
 mi = 129 ;
 mj = 65 ;
variables:
  float ssh(j,i) ;
    thetao:long_name = "sea surface height" ;
    thetao:units = "m" ;
    thetao:coordinates = "lon lat" ;
    thetao:mesh_coordinates = "meshlon meshlat";
  float lon(j,i) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(j,i) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north"
  float meshlon(mj, mi) ; 
    meshlon:long_name = "longitude" ;
    meshlon:units = "degrees_east" ;
  float meshlat(mj, mi) ;
    meshlat:long_name = "latitude" ;
    meshlat:units = "degrees_north" ;

davidhassell commented 4 years ago

@taylor13 I don't think that the index value functionality is (yet) in CF. Please correct me if needs be! Is it part of the Gridspec standard (which I'm not too familiar with, but thought it provides a means for storing cubed-sphere grids)?

@adcroft You're right that the order of dimensions has to be the same between an auxiliary coordinate variable and its bounds variable (I missed this earlier - sorry), but the this order does not have to correspond to the order of dimensions of the data variable.

The order of variable names given by the coordinates attribute is arbitrary, so I still have the issue in the last example CDL, that the only way of connecting the auxiliary coordinate variable (e.g. lon) to its bounds variable (e.g. meshlon) is via those variables' descriptive attributes. This is fragile.

Given that a bounds variable dimensions must be ordered like its parent coordinate variable. This could be extended to "mesh" bounds variables, and allow, as I suggested earlier,

  float lon(j,i) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
    lon:mesh = "meshlon" ;
  float meshlon(mj, mi) ;

where it would now be given that mj <-> j and mi <-> i. However, unlike with "normal" bounds, this is not always check-able. Is that a problem?

taylor13 commented 4 years ago

@davidhassell : In the CF conventions document, we don't explicitly say anything about using pure index dimensions, but we do say:

The use of coordinate variables is required for all dimensions that correspond to 
one dimensional space or time coordinates.

Here the term "coordinate variables" refers to a variable defined according to the NUG definition

A variable with the same name as a dimension is called a coordinate variable. It typically 
defines a physical coordinate corresponding to that dimension. ...

A position along a dimension can be specified using an index. This is an integer with a 
minimum value of 0 for C programs, 1 in Fortran programs. ...

If a dimension has a corresponding coordinate variable, then this provides an alternative, 
and often more convenient, means of specifying position along it. Current application 
packages that make use of coordinate variables commonly assume they are numeric 
vectors and strictly monotonic (all values are different and either increasing or decreasing).

The last paragraph implies that not all dimensions have a corresponding coordinate variable, and I'll refer to these as "pure index dimensions".

The CF-convention uses the coordinates attribute to define coordinates when a coordinate variable cannot be defined (e.g., for an unstructured grid that is stored as a 1-dimensional array). My reading of the convention is that there is no need to explicitly define the index values for this case; the "pure index dimension" implies the index values as 0, 1, 2, ... N. Thus, as an example of a possible unstructured grid, the following would be correct:

dimensions:
  ij = 8192 ;
variables:
  float ps(ij) ;
    ps:long_name = "surface pressure" ;
    ps:units = "Pa" ;
    ps:coordinates = "lon lat" ;
  float lon(ij) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(ij) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;

Note that there is no INT ij(ij) in the file.

Similarly for a cubed-sphere grid logically structured as a 3 dimensional array, the 3 dimensions of a variable would all be "pure index dimensions", and those indexes wouldn't have to appear as (NUG) coordinate variables in the file.

The reason I brought this up is that in https://github.com/cf-convention/discuss/issues/5#issuecomment-561320471 above, i, j, dim1, and dim2 were all explicitly defined as coordinate variables when I don't think this should be necessary. It appears that the suggested approach requiring definition of coordinate variables i, j, dim1, and dim2 has been dropped, so I don't think we have a problem unless what I've said here is somehow incorrect (in which case some clarification is needed).

adcroft commented 4 years ago

@taylor13 We were dropping using coordinate variables because they are not normally required, as you just confirmed.

@davidhassell

I still have the issue in the last example CDL, that the only way of connecting the auxiliary coordinate variable (e.g. lon) to its bounds variable (e.g. meshlon) is via those variables' descriptive attributes. This is fragile.

If I renamed lon to fred, the only way we know that fred is longitude is through the appropriate attributes. Surely that is the only definitive way to figure out what a variable is? But I agree, it requires a lot of checking and compliance by coders so is fragile.

I see that adding the mesh attribute is analogous to the bounds attribute which is why you attach it to lon. My motivation for attaching mesh to the data variable is partly a matter of principle: finite volume data does not have a position but a volume. Oneday I would like to not have to provide made up cell center positions but I'll save my extremism for that far off oneday. :)

I am struggling to think of another solution to the dimension association problem while attaching the mesh attribute to the data. The problem is solved by @davidhassell's approach of attaching mesh to auxiliary coordinate lon. So it seems there are two paths:

Emulate bounds with a mesh attribute on auxiliary coordinate data.
Impose a restriction on dimension order to be alike across associated data, auxiliary coordinates, and mesh coordinates allowing mesh_coordinates to be attached to data variables.

One more heretical thought: there is a line in the cell boundaries section that reads "A boundary variable will have one more dimension than its associated coordinate or auxiliary coordinate variable." (it is italicized in the document for emphasis). If we allowed for a bounds variable with no extra dimension then this is exactly what the mesh attribute in path 1 is doing. Would it be simpler to just allow the extra dimension for bounds to be optional?

BTW, that NUG definition of a coordinate variable caught me off guard - it seems I've been using the wrong terminology. What is defined there as a "coordinate variable" is what I was referring to as a "dimension variable" because we have meaningful variables that are coordinates which in normal conversation one would call a "coordinate variable". I've been using these terms incorrectly for too long that I don't know where I got them from but @JimBiardCics corrects someone's terminology in a post from 2011 so I'm at least 8 years out of date!

neumannd commented 4 years ago

@taylor13 I added the definition of the dimension variables for the indices i and j (in the comment you refered to) in my comment further above to create a clear mapping between the dimensions i/j and dim1/dim2 via a mesh attribute attached to i and j. That's why these variables appear in later comments.

Chilipp commented 4 years ago

Hey @rabernat, @adcroft, @davidhassell et al., I'd like to add another point: 3D meshes on a 2D curvilinear grid? We use something like this within the COSMO-CLM model. To be more explicit: Consider a 3D variable T on a circumpolar grid with the new mesh attribute

dimensions:
    rlon = 101 ;
    rlat = 111 ;
    rlon1 = 102 ;
    rlat1 = 112 ;
    level = 40 ;
variables:
    float T(level, rlat, rlon) ;
        T:coordinates = "lon lat" ;
    float lat(rlat, rlon) ;
        lat:mesh = "meshlat" ;
    float lon(rlat, rlon) ;
        lat:mesh = "meshlon" ;
    float meshlat(rlat1, rlon1) ;
    float meshlon(rlat1, rlon1) ;

here the mesh variables meshlon and meshlat have both the dimensions rlat1, rlon1.

Now, suppose we also want to specify the bounds of the vertical level dimension? How does this work? Should it be

    float meshlevel(level1,rlat1, rlon1) ;

with level1 = level + 1 or rather

    float meshlevel(level1,rlat, rlon) ;

I think both should be accepted, although the first would be preferable for the purpose of visualization.

Chilipp commented 4 years ago

Another point: How is it supposed to continue with this thread? I'd very much like to see this implemented in the conventions. According to the README of this repo, this issue should be raised at https://github.com/cf-convention/cf-conventions/issues after the discussion. But who defines what after the discussion exactly means? Are there any guidelines or experiences that state at which point we move this issue over to the cf-conventions repo?

Chilipp commented 4 years ago

I'd like to add another point to the discussion, that was partly mentioned by @davidhassell in https://github.com/cf-convention/discuss/issues/5#issuecomment-561755801

However, unlike with "normal" bounds, this is not always check-able. Is that a problem?

Yes, I think with the current implementation, this is a problem. Consider the following dataset

dimensions:
    i = 20 ;
    j = 20 ;
    i1 = 21 ;
    j1 = 21 ;
variables:
    double T(i, j) ;
        T:coordinates = "lat lon" ;
    double lat(i, j) ;
        lat:mesh = "meshlat" ;
    double lon(i, j) ;
        lon:mesh = "meshlon" ;
    double meshlon(i1, j1) ;
    double meshlat(i1, j1) ;

I use xarray, to illustrate my point, but I am sure the problem exists with others as well. If I would do an ds.isel(i=[1]) here, I end up with something like

dimensions:
    i = 1 ;
    j = 20 ;
    i1 = 21 ;
    j1 = 21 ;

which is definitely not what I want (i has size 1, i1 still has size 21). But how should xarray know?I don't think that the mesh attribute in some variable within the dataset is sufficient for a clear and well-defined description. Particularly as the lon variable here is strictly speaking independent i and j. What, for example if I'd have another variable in there like

    double lon2(j) ;
        lon:mesh = "meshlon2" ;
    double meshlon2(j2) ;

is now j2 equivalent to j1? should both dimensions be sliced?

One possibility might be, to add an attribute to an i coordinate variable, i.e. in the sense of

    double i(i) ;
        lon:mesh_dimension = "i1" ;

But I think this is not a good solution as this makes the dimension i dependent on another variable, and these coordinate variables i are often not desired (such as in this case) as the spatial information is in lat and lon

ChrisBarker-NOAA commented 4 years ago

Sorry Folks, I wasn't following this at the time, and I haven't carefully read the whole thing, but:

My first thought was: WTF? we've been using curvilinear quad meshes in CF forever! But I suppose the issue is that we were making a lot of assumptions about cells, rather then having teh well specified.

So thought 2: Isn't this EXACTLY what SGRID is about? http://sgrid.github.io/sgrid/

And while we are at it, unstructured grids should be accommodated as well: http://ugrid-conventions.github.io/ugrid-conventions/ (I did see that briefly mentioned in this issue)

Both of those have been designed as extensions to CF, and are in use, at least a little bit.

Anyway, let's not reinvent the wheel here.

NOTE: those aren't so good at dealing with meshes that wrap around the earth -- but they could be extended for that.

rabernat commented 4 years ago

Good points Chris. The issue is sgrid doesn't have nearly the same level of visibility / adoption as CF itself.

Both of those have been designed as extensions to CF, and are in use, at least a little bit.

Is there a formal mechanism for "extensions to CF"? If so, is there a directory of such extensions? And what defines the scope of extensions vs. the main spec?

ChrisBarker-NOAA commented 4 years ago

BTW: there have been proposals to make these auxiliary standards "official"

https://github.com/cf-convention/cf-conventions/issues/153

ChrisBarker-NOAA commented 4 years ago

"Is there a formal mechanism for "extensions to CF"

Well, I was using it in an informal way -- which means that they were designed to be able to be used without conflicting with CF. And I think it is "official" that CF can be used in combination with other specs:

// global attributes:
        :Conventions = "CF-1.0, SGRID-1.1" ;

And yes, CF does have much more recognition -- so it would be a great idea for CF to "bless" some of these other conventions. I'm just saying that we really don't want to start all over again with the design.

davidhassell commented 4 years ago

Hello @rabernat,

It'd be very useful if you briefly summarize this issue https://docs.google.com/document/d/1urPWngzDCuHTrfpA8nedGoRDVKXs5OmjqO8M6i3UZJM/edit#, including what might be good outcomes from a discussion at the CF meeting. If this could be done today or tomorrow that would be best, as we will use it to help people decide on which sessions to attend in advance of the meeting.

Many thanks, David

rabernat commented 4 years ago

Hi @davidhassell - I went to add my summary, but the link you provided doesn't give me edit access.

rabernat commented 4 years ago

Ok done.

ChrisBarker-NOAA commented 4 years ago

I'm not sure where else to post this, but:

As far as discussing the possiuble inclusion of the SGRID standard in CF at the 2020 CF Workshop, maybe we should discuss UGRID as well.

It's not really a new topic -- there are lots of details, but a key idea is the same: a "mesh" or "grid" variable which has attributes describing how the grid is defined.

grid:cf_role = grid_topology

http://ugrid-conventions.github.io/ugrid-conventions/

So if CF IS going to include either of these, it makes sense to do it in a unified way.

JonathanGregory commented 4 years ago

Dear @ChrisBarker-NOAA As you probably know, it's already expected, or at least proposed, that ugrid will be incorporated in CF. The proposal is that the documents won't be merged, but that the versions will be linked. Please see https://github.com/cf-convention/cf-conventions/issues/153 and comment if you like. Best wishes, Jonathan

ChrisBarker-NOAA commented 4 years ago

@JonathanGregory : Thanks, I had completely lost track of the status of that.

But there are parallels here, and I think it's best that CF deal with UGRID and SGRID (and maybe other GRids in the future?) in similar ways. So we should reference the UGRID status when talking about what to do with SGRIDS.

In particular, it seem that idea in this thread is to make a "grid variable" or "mesh variable"as a standard place to put the grid mapping info part of CF itself. Then we could say that the contents of that variable are left up to auxiliary standards, such as UGRID and SGRID

-CHB

davidhassell commented 4 years ago

Hello,

The timings and order of the breakout groups for the CF meeting next week has now been set (see http://cfconventions.org/Meetings/2020-Workshop.html), and the discussion of this issue will be on Wednesday 10 June from 17:30-19:00 UTC, in parallel with three other topics.

Thanks.

rabernat commented 4 years ago

I was looking forward to the CF meeting discussion today. But recent events in the US have caused my plans to change. In response to the murders of George Floyd, Breonna Taylor, and countless other Black people, today there will be a general strike across STEM and academia (https://www.shutdownstem.com/). The idea is to pause business as usual and take action against racism. I support this issue strongly, and so I will not participate in the CF discussion today.

To make up for my absence, I have put my presentation online (https://speakerdeck.com/rabernat/cf-mesh-coordinates-proposal) and recorded a video of myself presenting it (https://vimeo.com/427709117). I know this is not the same as being there “in person,” but I hope it is enough. I will be happy to resume the discussion tomorrow via this github issue.

zklaus commented 3 years ago

A recent email exchange made me remember this issue. Has there been any progress? @rabernat, is there still an appetite to drive this forward? Perhaps it could be discussed at the next CF meeting?

ChrisBarker-NOAA commented 3 years ago

There is discussion of UGRID right now here: https://github.com/ugrid-conventions/ugrid-conventions/issues/52

Not sure the status of SGRID, but they are closely related.

rabernat commented 3 years ago

Also checking in after a long time away. Lots of discussion but no decision in https://github.com/ugrid-conventions/ugrid-conventions/issues/52. Looking at the CF convention rules, perhaps it would help the process move along if someone made a formal proposal following those rules which would then reach an up-or-down decision in a finite time. Otherwise I think we could remain in limbo for a very long time.

I would prefer to bundle SGRID and UGRID into a single proposal, but others may disagree.

JonathanGregory commented 3 years ago

Dear Ryan @rabernat

Thanks for stirring this. It looks like both UGRID issue 52 and CF issue 153 have nearly reached agreement on definite proposals, but they need summarising, separately or together, so we can see where we are. Perhaps someone can do that before the CF workshop (21-23 Sep). Is it planned to discuss UGRID in a breakout group? I do not see it mentioned in this list.

I would prefer to bundle SGRID and UGRID into a single proposal, but others may disagree.

I think we are quite likely to be able to reach agreement soon on UGRID on the basis of the discussions that have already been had, whereas we have not discussed SGRID much. I would therefore suggest that we try to conclude with UGRID before starting on SGRID.

Best wishes

Jonathan

ChrisBarker-NOAA commented 3 years ago

UGRID and SGRID have a lot in common, so it may have made sense to do them together. However, there's also sometyyhing to be said for focusing on UGRID, and then adding SGRID with the lessons learned. I'm hoping that will be an easy lift at that point :-)

cf-convention / discuss

"mesh variable" instead of "boundary variable" for contiguous grid cells #5