Closed MaartenSneepKNMI closed 10 months ago
The cf-convention issue tracker on github is intended for issues
about the web site, not about the CF Conventions themselves.
For issues about the CF Conventions, you can use the Trac issue
tracker cf-trac.llnl.gov/trac .
For general discussions of the CF Conventions, you can use the
CF-metadata mailing list, cf-metadata@cgd.ucar.edu. The Trac issue
tracker generally forwards its entries to this mailing list (There
is a provision for individuals to opt out of this feature).
In order to control spam, we allow only registered users to enter
issues on the Trac issue tracker. Maarten, if you give me your
email address and tell me what user name you want, I will register
you.
- Jeff Painter
On 3/3/15 9:02 AM, Maarten Sneep wrote:
In the current CF the following is needed to specify vertical
profiles with boundary information:
dimensions:
lon = 360;
lat = 180;
layer = 18;
vertices = 2;
variables: float lat(lat); lat:long_name = "latitude"; lat:units = "degrees_north"; lat:bounds = "lat_bnds"; float lon(lon); lon:long_name = "longitude"; lon:units = "degrees_east"; lon:bounds = "lon_bnds"; float pressure(layer, lon, lat); pressure:long_name = "pressure grid"; pressure:units = "hPa"; pressure:bounds = "pressure_bnds"; float lat_bnds(lon,vertices); float lon_bnds(lat,vertices); float pressure_bnds(layer,lon,lat,vertices); float O3(layer, lon, lat); O3:units = "1e-9"; O3:coordinates = "pressure";
In this example one of the vertical interfaces is the
tropopause pressure, which doesn't fit nicely into a compression
scheme or parametrization. However, the only useful vertical
grid are the interfaces, the pressures themselves are useless.
Furthermore the pressure grid is complete, i.e. the boundaries
fill the vertical completely from surface to the top pressure
without gaps. An extra dimension would suffice to describe the
vertical grid in a much more compact way:
dimensions:
lon = 360;
lat = 180;
layer = 18;
interfaces = 19;
variables: float lat(lat); lat:long_name = "latitude"; lat:units = "degrees_north"; lat:bounds = "lat_bnds"; float lon(lon); lon:long_name = "longitude"; lon:units = "degrees_east"; lon:bounds = "lon_bnds"; float pressure(interfases, lon, lat); pressure:long_name = "pressure interface grid"; pressure:units = "hPa"; float lat_bnds(lon,vertices); float lon_bnds(lat,vertices); float O3(layer, lon, lat); O3:units = "1e-9"; O3:coordinates = "pressure";
In the current CF conventions the pressure information is
almost 3 times larger than the data itself, which seems
inefficient. The n+1 method of storing the
interfaces is rather common, for instance it is used by ECMWF in
many of its datasets.
I propose to allow an 'off by one' method of storing vertical
information only as a pressure grid with only interfaces, rather
than pressure levels.
—
Reply to this email directly or view it
on GitHub.
I don't understand this reply. This is the github group on CF2, not the website tracker. I've been involved in the CF2 discussion for some time now, haven't received comments like this on earlier contributions. I fear that you are mistaken in this comment.
You are right - I'm sorry to have made the comment. I had thought your comment was in another issue tracker.
OK, no worries.
Dear Maarten
I understand the point. I expect you will not be surprised to know that the CF way of storing 2n bounds rather than (n+1) has been discussed a few times, starting in the early days. As I'm sure you understand, the reason we chose to do this is to allow for the possibility of non-contiguous or overlapping cells. Although these cases are not common, it is convenient to have only one convention which will always work - that makes things simpler for users and software. A second reason is that netCDF doesn't support arithmetic with dimension declarations, so that you would need apparently unrelated n and (n+1) dimensions, as in your example. I don't argue that everything in CF is the best possible choice, but we have made choices, and I think we should carry on in CF2 with the existing convention unless it's clearly wrong or inadequate.
The cost is the near-doubling of storage for bounds in the common case. This isn't usually a problem because coordinate variables are more often 1D. In your case, it adds storage of the size of the 3D data variable (because you have two bounds rather than one for your 3D auxiliary coord var). Even so, for a single file, is that really a problem? Your example is only 4.7 Mbyte for the data variable. I would suggest that we can live with this. If you have many data variables with the same issue in the same file, they share the coordinate variables, so the problem is proportionately less.
I think that the problem is most serious when you have many files, all of which have to contain a copy these large auxiliary coord variables. This sort of problem has also been raised before, not just for bounds, but for the aux coords themselves, which can also be large if multidimensional. Rather than change the bounds convention, I think we should address this larger problem, to which the obvious solution (suggested several times) is that the files should share a single copy of the coord vars. Because CF-netCDF has so far been a convention for single independent files, we haven't done this before. This situation requires a convention for treating several netCDF files as related. I am soon going to propose a simple convention for this, because it is needed for CMIP6, for exactly this reason i.e. to allow many netCDF files to share one copy of multidimensional grid metric variables. Would this be all right for you?
By the way, I do not actually agree that the coord vars are not useful. Even though you may not use them for computations with your ozone field, you (or someone else) might use it for plotting, for example. It may well be that the coord is a bit arbitrary, but it's still useful for a data-user to have a sensible suggestion from the data-writer about where the data value should be located.
Best wishes
Jonathan
Sharing a single copy is also not always possible, for instance if one of the interface levels is dynamic, such as a tropopause height. While you can store multiple profiles at a specific time, time evolution will be an issue. While I don't mind a convention for connecting data across files (URI's, DOI's, OpeNDAP, there is the internet after all, I'm interested how to maintain such a connection after downloading the file), I don't think it is addressing the same problem (although it can be an example of addressing a similar problem). I assume such a proposal will appear as an issue here, and I'll gladly comment on it there.
I agree that for 1D or even 2D variables the issue is a non-issue, although I've encountered data suppliers - who shall remain nameless - who objected to even that. There are many suppliers who will supply (n+1) datapoints for the interfaces, and call it a day, regardless what CF says. And what they skip are the coordinates, not the boundaries, with an argument: just take a point in the middle, it is meaningless anyway.
And yes: with 18 layers and 1 degree horizontal resolution, the problem is not a big issue at all. Go to 137 layers and 17 km horizontal resolution (even if that is reduced Gaussian grid), and it starts to hurt.
Maarten, Your proposal is elegant and efficient yet I'm ambivalent about it not only because Jonathan's arguments are persuasive, but for the additional reason that the "off-by-one" nature (relative to the host variable) of the vertical coordinate in the "coordinates" attribute is, without further clarification, likely to confuse many users. People will look at N levels in the host variable and N+1 in the vertical coordinates attribute and go, "huh?". I think some additional means of clarifying that the "off-by-one" is intentional would benefit the proposal. The string "pressure interface grid" in the long_name comes close, but something more standardized than a long_name attribute may be better. Best, Charlie
Sure, if we go this route, then this should be documented and probably supported with the appropriate standard_name values. Or did you have another attribute in mind?
Hi Maarten,
Adding distinct standard_name for interface variables is simple, yet
does not seem a fully satisfactory way to convey the off-by-one issue.
Then there would be two names for pressure, one for the pressure
whose vertical coordinate is at layer midpoints and one for the
pressure whose vertical coordinate is at layer interfaces.
A dataset containing both would then have variables with
standard_names like air_pressure and air_pressure_at_layer_interfaces.
While that is clearer (to me) than using air_pressure for the
off-by-one case, it still leaves something to be desired.
Mainly, it does not clearly convey the distinction between the two
pressures.
Maybe there is no crystal-clear way to convey off-by-one because it is a subtle property. I hope others who like your off-by-one proposal will chime-in with suggestions on whether and how to convey that the off-by-one convention applies to a particular variable. If nobody suggests something clearer than a new standard_name then so be it. Charlie
@czender and @MaartenSneepKNMI -- is this issue worth migrating to the cf-conventions repository? Trying to limit the sprawl and archive things for posterity.
Yes, this issue is worth migrating in my opinion
OK. @czender or @MaartenSneepKNMI -- would you be willing to summarize the issue in the latest template? Here is the markdown template:
**Title:** Short and descriptive.
**Moderator:** @user
**Requirement Summary:** A few sentence functional summary
**Technical Proposal Summary:** Brief proposal overview
**Benefits:** Who or what will benefit from this proposal?
**Status Quo:** Discussion of the current state CF and other standards.
**Detailed Proposal:** Complete proposal
Here's a thought on how to address this. Create an attribute for a coordinate variable that declares whether the values are to be considered as applying to the cells or as applying to the cell boundaries. The default (if the attribute is not present) is that the values are to be considered as applying to the cells. This is akin to the image processing schemes in which cell location information references the center of a cell vs those where the location information reference a corner of a cell. (Always fun when people confuse the two!)
If a coordinate variable is declared to be an "interface" coordinate, you would be required to provide the far interface (the N+1 interface) in the coordinate variable. The data variable would reference the same dimension as the coordinate variable, just as we do now, but the values in the N+1 hyperslab would be set to _FillValue
to denote that there is no data there. CDL for this would look something like:
dimensions:
lon = 360;
lat = 180;
interfaces = 19;
variables:
float pressure(interfaces, lon, lat);
pressure:long_name = "pressure grid";
pressure:units = "hPa";
pressure:coord_type = "interface";
float O3(interfaces, lon, lat);
O3:units = "1e-9";
O3:coordinates = "pressure";
The attribute coord_type
(which is a strawman name) with a value of "interface" indicates how to interpret the values.
This could apply to any coordinate variable if we want to allow it.
I understand that difficulties that are trying to be addressed, here, but I am not in favour of this proposal. This is partly because I agree with the arguments laid out by @JonathanGregory (https://github.com/cf-convention/cf-conventions/issues/190#issuecomment-77123170).
I'm also concerned about the dimension being sub-setted or reordered by a user. In general this would, I think, not be a well defined operation. For example, if the interfaces for 4 levels were 1 2 3 4 5
, what would the interfaces be if you created a new data variable from the first and third levels? 1 ? 4
? This is not a problem with the existing conventions.
There is shortly going to be an issue that proposes an extension that allows coordinates and bounds (and any other geo-locating variables) to be external variables, so you could legally place your coordinates and/or bounds in a separate file. I appreciate that this may not be appropriate for particular use-case, but your comments on that would be most welcome.
Thanks, David
@JimBiardCics I like your idea because it directly addresses the main issue highlighted by Maarten and simplifies the overall storage scheme. Moreover, netCDF will automatically populate the last level of N+1 dimensional space with _FillValue if the nc_put_vara() routines are called with only N dimensions. I think storing coordinate info in a separate file is great for CMIP-style archives of thousands of single variable files but is too heavyweight for most single-user applications.
I want to make sure I understand the motivation. Is the whole point to reduce the amount of storage occupied by the coordinate information? Or are there other practical or aesthetic considerations motivating the proposal?
@taylor13 I see the motivation as three-fold. There is the space issue, which I don't see as a major issue — particularly when compared with swath geolocation data. There is the repeated request by the user community for a scheme such as this. And then there is the conceptual difference between volume-oriented data and boundary-oriented data.
I'm more moved to make a change by the second two motivations, which I think are sometimes related. Certain disciplines prefer to associate data with the boundaries between cells as understood by CF rather than with points within the cells (such as the centers). CF can actually handle this on some level by specifying a boundary as the value in a coordinate variable and specifying the boundary and next boundary in a bounds variable, but if I have understood the discussions about this over the years, this approach is not intuitive for users in those disciplines. I think it may imply things that are incorrect in some cases.
As I said, I don't have direct experience with these sorts of difficulties, so if someone can do a better job of expressing the issues (assuming they do exist), please speak up.
thanks, @JimBiardCics . That's helpful.
I think the problem being addressed here could be better handled by extending the use of cell_methods to the vertical dimension. As a reminder, in the horizontal we often have a grid defined at all cells over a domain, but for some cells the variable represents only a portion of the cell area. For example, cell_methods = "area: mean over sea_ice"
indicates that the value reported applies only the the portion of the cell covered by sea ice. To subsequently compute an area-weighted mean of the quantity, one must access the grid cell areas and also the sea-ice area cell fraction.
For the use case described above, I assume that the layers are constant pressure thickness except the layer adjacent the tropopause and the layer adjacent to the surface. In this case I think the vertical coordinate could simply be pressure and the pressure bounds too could be constant across the globe . We could extend the cell_methods to include something like:
cell_methods="pressure: mean within troposphere"
or where troposphere
OR
cell_methods="pressure: mean above surface, below tropopause
where we would have a limited number of options on what could follow "within" (or "above" and "below"); for example, "troposphere", "stratosphere", etc. (or for the 2nd option "surface", "tropopause", "stratopause", etc.)
In order to determine the pressure thickness of the cells adjacent to the earth's surface and adjacent to the topopause, the data provider would have to provide the surface pressure field and the tropopause pressure field (just as the case of horizontal grids, the data provider has to provide the area fraction for sea_ice).
An advantage of this approach, is that only a single 3-d field is saved (supplemented by possibly two 2-d fields: surface pressure and tropopause pressure).
@taylor13 I think your suggestion is most suited to the case where the pressure thicknesses of the internal layers are constant across a level, as you assume. I (and perhaps @MaartenSneepKNMI ?) am more concerned with the general case where the thicknesses vary horizontally and vertically. A CF-compliant way of storing the vertical bounds that requires only one 3D array (interfaces only), rather than three 3D arrays (centers, lower boundaries, and upper boundaries) would I think be closer to addressing the OP. The "interface" coordinate type suggested by @JimBiardCics seems to do that without too much complexity or too little generality. @davidhassell poses the issue of well-defined interface coordinates for vertically discontiguous hyperslabs, where up to 2N interfaces (rather than N+1) are needed. It is a good question, though it does not exclude Jim's idea from consideration. A vertically discontiguous hyperslab with well-defined interfaces could be created by having two non-default coordinate types, "contiguous_interface" (with an N+1 dimension) and "discontiguous_interface" (with a 2N dimension).
@MaartenSneepKNMI opened this issue in 2015. There was some discussion at that time, and some more (after a four-year pause) in 2019. Since then there have been no further contributions. The contributors have included @czender, @davidhassell, @taylor13 and myself. The discussion was about how to store the vertical coordinate of layer interfaces of 3D grid cells when the vertical coordinate depends on 2D location. The main motivation for the issue was a wish to conserve space in files.
Does anyone have a present use-case for this, or another reason for taking it forward? I propose to close this issue, marked as dormant
, if no-one makes a new contribution to the issue before 9th January (four weeks from today i.e. allowing an extra week for the imminent holidays).
Thanks Jonathan. For the record, I'm fine with this issue being closed as "dormant" at this time.
I have no interest in carrying this forward, so fine with it being closed (unless, of course, someone objects with reason).
I agree with marking it "dormant".
In the current CF the following is needed to specify vertical profiles with boundary information:
In this example one of the vertical interfaces is the tropopause pressure, which doesn't fit nicely into a compression scheme or parametrization. However, the only useful vertical grid are the interfaces, the pressures themselves are useless. Furthermore the pressure grid is complete, i.e. the boundaries fill the vertical completely from surface to the top pressure without gaps. An extra dimension would suffice to describe the vertical grid in a much more compact way:
In the current CF conventions the pressure information is almost 3 times larger than the data itself, which seems inefficient. The
n+1
method of storing the interfaces is rather common, for instance it is used by ECMWF in many of its datasets.I propose to allow an 'off by one' method of storing vertical information only as a pressure grid with only interfaces, rather than pressure levels.