cf-convention / CF-2

Group Repo
5 stars 1 forks source link

Support for Groups in CF-2.0 #4

Closed MaartenSneepKNMI closed 5 years ago

MaartenSneepKNMI commented 10 years ago

Use case

While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.

The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation. A simple quality indicator is included as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, etc. We don't want to bother most users with these details, and have therefore put these variables in another group.

The problem

The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. I will give a few solutions here as a starting point, and we will see where we end up. I've selected to use one of these options as a stop-gap measure, but we are (within reason) flexible enough to support either of these options.

Basic requirement for the solution

Variables that are linked to main variables, for instance via the 'ancillary_variables' attribute, but also in the 'bounds', 'coordinates' and probably other attributes as well, must use the same dimensions.

Reference structure

+ /PRODUCT
| /PRODUCT/scanline(scanline) (DIM)
| /PRODUCT/ground_pixel(ground_pixel) (DIM)
| /PRODUCT/corners(corners) (DIM)
| /PRODUCT/latitude(scanline, ground_pixel)
| /PRODUCT/longitude(scanline, ground_pixel)
| /PRODUCT/ozone_column(scanline, ground_pixel)
+ /PRODUCT/SUPPORT_DATA/processing_flags(scanline, ground_pixel)
| /PRODUCT/SUPPORT_DATA/latitude_bounds(scanline, ground_pixel, corners)
| /PRODUCT/SUPPORT_DATA/longitude_bounds(scanline, ground_pixel, corners)

Possible solution 1: Follow the scoping rules for dimensions

Follow the scoping rules for dimensions, and search all of the scope where the dimensions of the main variable can be used. The netCDF-4 C++ interface provides nice options for this, although more convenient support may be added to that interface later on.

In the example, the /PRODUCT/latitude variable has an attribute bounds with value latitude_bounds, while the /PRODUCT/SUPPORT_DATA/processing_flags variable has an attribute coordinates with value latitude longitude.

To find the actual variables, first the application find the dimensions (using std::set<NcDim> netCDF::NcGroup::getDims(), with netCDF::NcGroup::ParentsAndCurrent as the search scope), then starting from group where the dimension is defined (NcGroup NcDim::getParentGroup()), and finally find the named variable within scope of the dimension (using NcVar netCDF::NcGroup::getVar() with NcGroup::Location::ChildrenAndCurrent as the search scope). Other interfaces may make it harder to implement this pattern, but that is only a temporary limitation I think.

Note that this places other restrictions on the file, such as the inability of using the same name for a variable in two different groups within the same dimension search scope. I'm not sure this is a restriction at all, but it is something to keep in mind.

Possible solution 2: Use HDF-5 paths to point to linked variables.

This solution is more explicit, and uses HDF-5 paths to explicitly point to the location of a linked variable.

In the example, the /PRODUCT/latitude variable has an attribute bounds with value SUPPORT_DATA/latitude_bounds, while the /PRODUCT/SUPPORT_DATA/processing_flags variable has an attribute coordinates with value /PRODUCT/latitude /PRODUCT/longitude.

To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward.

Note that this solution uses the fact that the / character is used as a path separator in HDF-5 (and can therefore not occur in a variable- or group-name). This method puts a restriction on group names in that these should not contain spaces, as the lists of variables are space separated. A similar restriction is already in place (implicitly) on variable names in CF-1.6.

General note on variable names.

Within the S5P project we have put a restriction on 'element' names (groups, variables, attributes). NetCDF-4 allows an element name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard. To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.

We use the following restrictions:

The first limitation is for instance nice when using the (HDF-5) pytables interface for python, as it allows simple dot-notation to access variables in a file, but requires that all elements are valid python variable names. Adding a similar interface to the python netCDF4 package is on my (far too long) todo list.

Notes

See summary below. The variable name restriction now have their own issue #5.

BobSimons commented 10 years ago

Have you considered not using groups and instead using an attribute to differentiate the main output variables from the support variables?
Then the problems related to references to another group go away.

On 2014-10-20 3:32 AM, Maarten Sneep wrote:

Use case

While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.

The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation. A simple quality indicator is included as well. This should suffice for basic usage. For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, etc. We don't want to bother most users with these details, and have therefore put these variables in another group.

The problem

The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. I will give a few solutions here as a starting point, and we will see where we end up. I've selected to use one of these options as a stop-gap measure, but we are (within reason) flexible enough to support either of these options.

Basic requirement for the solution

Variables that are linked to main variables, for instance via the 'ancillary_variables' attribute, but also in the 'bounds', 'coordinates' and probably other attributes as well, must use the same dimensions.

Reference structure

|+ /PRODUCT | /PRODUCT/scanline(scanline) (DIM) | /PRODUCT/ground_pixel(ground_pixel) (DIM) | /PRODUCT/corners(corners) (DIM) | /PRODUCT/latitude(scanline, ground_pixel) | /PRODUCT/longitude(scanline, ground_pixel) | /PRODUCT/ozone_column(scanline, ground_pixel)

  • /PRODUCT/SUPPORT_DATA/processing_flags(scanline, ground_pixel) /PRODUCT/SUPPORT_DATA/latitude_bounds(scanline, ground_pixel, corners) /PRODUCT/SUPPORT_DATA/longitude_bounds(scanline, ground_pixel, corners)

    Possible solution 1: Follow the scoping rules for dimensions

Follow the scoping rules for dimensions, and search all of the scope where the dimensions of the main variable can be used. The netCDF-4 C++ interface provides nice options for this, although more convenient support may be added to that interface later on.

In the example, the |/PRODUCT/latitude| variable has an attribute |bounds| with value |latitude_bounds|, while the |/PRODUCT/SUPPORT_DATA/processing_flags| variable has an attribute |coordinates| with value |latitude longitude|.

To find the actual variables, first the application find the dimensions (using |std::set netCDF::NcGroup::getDims()|, with |netCDF::NcGroup::ParentsAndCurrent| as the search scope), then starting from group where the dimension is defined (|NcGroup NcDim::getParentGroup()|), and finally find the named variable within scope of the dimension (using |NcVar netCDF::NcGroup::getVar()| with |NcGroup::Location::ChildrenAndCurrent| as the search scope). Other interfaces may make it harder to implement this pattern, but that is only a temporary limitation I think.

Note that this places other restrictions on the file, such as the inability of using the same name for a variable in two different groups within the same dimension search scope. I'm not sure this is a restriction at all, but it is something to keep in mind.

Possible solution 2: Use HDF-5 paths to point to linked variables.

This solution is more explicit, and uses HDF-5 paths to explicitly point to the location of a linked variable.

In the example, the |/PRODUCT/latitude| variable has an attribute |bounds| with value |SUPPORT_DATA/latitude_bounds|, while the |/PRODUCT/SUPPORT_DATA/processing_flags| variable has an attribute |coordinates| with value |/PRODUCT/latitude /PRODUCT/longitude|.

To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward.

Note that this solution uses the fact that the |/| character is used as a path separator in HDF-5 (and can therefore not occur in a variable- or group-name). This method puts a restriction on group names in that these should not contain spaces, as the lists of variables are space separated. A similar restriction is already in place (implicitly) on variable names in CF-1.6.

General note on variable names.

Within the S5P project we have put a restriction on 'element' names (groups, variables, attributes). NetCDF-4 allows an element name like "χ²" (\u03C7\u00B2). This is probably very good for human readability, but accessing the field from a program or script (non-interactively) is probably pretty hard. To get the string into this text file I went into an interactive python3 shell, and asked it to print("\u03C7\u00B2"), and those numbers were obtained from a website. Other computer systems may offer more convenient access.

We use the following restrictions:

  • The names of NetCDF-4 elements must match the regular expression: |[a-zA-Z][a-zA-Z0-9]*|. This means that the name of a NetCDF-4 element can be used as a variable name in most programming languages.
  • The names NetCDF-4 elements use underscores to separate parts within a name. An exception to this rule is formed by attributes whose name is specified by an external standard or recommendation, such as the CF metadata conventions
  • The names of variables are all lower case, with the exception of chemical species and abbreviations.
  • The names of groups are all upper case.
  • It is recommended to limit the names of elements to 40 characters or less.
  • Elements names that only differ in capitalization are not allowed.
  • It is strongly recommended to ensure that names of variables are unique within a file.

The first limitation is for instance nice when using the (HDF-5) pytables interface for python, as it allows simple dot-notation to access variables in a file, but requires that all elements are valid python variable names. Adding a similar interface to the python netCDF4 package is on my (far too long) todo list.

— Reply to this email directly or view it on GitHub https://github.com/cf-convention/CF-2/issues/4.

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St, Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <>< <><

MaartenSneepKNMI commented 10 years ago

I do not see how adding an attribute to differentiate resolves the issue of having a large number of variables and a desire to provide some (hierarchical) structure in that mountain of data.

JohnLCaron commented 10 years ago

Hi Maarten:

A few comments:

1) the CDM has the notion of "full name" and "short name" of a variable. For netcdf-4, the full name corresponds to the HDF-5 path, as you describe. In HDF-5, there can be more than one path to the same variable, but not in netCDF-4. So I tend not to use "HDF-5 path" in order to keep this distinction clear.

2) CF has a few places where a reference to a variable is in an attribute string, like the bounds variable of your example and also the "coordinates" attribute. Using the "full name" in this case seems like the easiest thing to do for full generality. eg :coordinates = "/PRODUCT/lat /PRODUCT/lon /PRODUCT/time";

So I agree with your solution #2.

3) However, the simplicity of the current convention is nice to keep. so a variant to your "scoping rule" proposal would be to use reletive paths. So, if the name starts with a "/", its a full path name. Otherwise, it names a variable reletive to the current group. The common case is that it names a variable in the same group, and so looks like the current CF convention. eg :coordinates = "lat lon time";

I think this should also be supported. Both are unambiguous and simple to implement.

4) I agree that object names (variable, attribute, group) should be limited, so that they can be used, eg for program variable names. [a-zA-Z][a-zA-Z0-9]* is a good choice. If one wants to use a different character set (eg chinese) then put it in an attribute of the variable.

Regards, John

MaartenSneepKNMI commented 10 years ago

Thanks John, I'll wait for a few more comments, and then rephrase the proposal. "full name" and "short name" are good names to use if these are used in the CDM. There are functions in the NetCDF interface to obtain the full name for a group but not a variable, but this is probably a rather simple set of calls anyway. I fully agree that whatever solution we come up with, the current short name use for variables in the same group shall be valid.

I think I'll split off the variable name issue into a separate one.

MaartenSneepKNMI commented 9 years ago

Discussion Summary

Use case & proposal by Maarten Sneep MaartenSneepKNMI

Use case

While developing the file format guidelines for the upcoming Sentinel 5-precursor ESA earth observation mission, I ran into some limitations of the CF-1.6 conventions.

The number of output fields in our data products is large. To help our users distinguish the main output fields from the support data, we want to use groups. The main data contains for instance a total ozone column, its precision and the main geolocation fields (as we have an observation swath, not a regular projection). A simple quality indicator is included at this level as well. This should suffice for basic usage.

For us (retrieval algorithm developers) and other advanced users more details are needed, such as detailed processing flags, intermediate results, column values of trace gases that are fitted in addition to the main parameter, model parameters to translate a slant column to a vertical column, the slant columns themselves, pixel corners, fit quality parameters such as a χ² value, etc. We don’t want to bother most users with these details, and have therefore put these variables in another group.

The problem

The current CF-1.6 does not support this. References from a variable in one group to one in another are not supported. Based on early feedback in the issue discussions, I propose the solution described below for CF-2.0.

Basic requirements and limitations for the solution

  1. The current behaviour and rules that are available in CF-1.6 for variables in the same group should be part of CF-2.0.
  2. Support variables that are linked to a variables, for instance via the ‘ancillary_variables’ attribute, but also in the ‘bounds’, ‘coordinates’ and probably other attributes as well, must use the same dimensions as the originating variable (although others may be added, for instance for the variables referenced from the ‘bounds’ attribute.

The first is to avoid needless backward incompatibility, the second is to ensure that variables match to each other.

Terminology

Key in the proposal are the variable names that are used in the attributes used to reference other variables. Referring to the “CDM Object Names” page, we have the following terminology:

Suggested solution

Use full names in the attribute attached to the source variable to describe the location of the referenced variable, if the source variable and the destination variable are in different groups.

For variables in the same group the short name shall be used.

Notes

  1. The full name as defined on the “CDM Object Names” page starts with the first group name, without an explicit root. To more easily distinguish between short- and full names, I suggest to extend the full name to include the root, and start full names with a ‘/’. This corresponds to the conventions for an “HDF-5 path”.
  2. Because lists of variables are space separated, the space character can not be a legal character within an object name in a CF compliant NetCDF-4 file. This is not new, as the same restriction applied to variable names within CF-1.6. See issue #5 for a more detailed discussion on object names.

Example

Imagine the following set of groups, dimensions and variables.

+ /PRODUCT
| /PRODUCT/scanline(scanline) (DIM)
| /PRODUCT/ground_pixel(ground_pixel) (DIM)
| /PRODUCT/corners(corners) (DIM)
| /PRODUCT/latitude(scanline, ground_pixel)
| /PRODUCT/longitude(scanline, ground_pixel)
| /PRODUCT/ozone_column(scanline, ground_pixel)
+ /PRODUCT/SUPPORT_DATA/processing_flags(scanline, ground_pixel)
| /PRODUCT/SUPPORT_DATA/latitude_bounds(scanline, ground_pixel, corners)
| /PRODUCT/SUPPORT_DATA/longitude_bounds(scanline, ground_pixel, corners)

To find the actual variables, some string manipulations are needed to find the group names, and then finding the variables is probably fairly straightforward. Apparently the Java-NetCDF interface already has such a call.

czender commented 9 years ago

Hello Maarten et al.,

Thanks to Aleksandr Jelenek for pointing me to this discussion. Maarten asked me this question a year ago and my response could be characterized as the opposite of what John recommends :) Here are the three plausible options mentioned so far:

  1. :coordinates = "lat lon time"; // Scoping rules determine nearest ancestor
  2. :coordinates = "/absolute_path/lat /absolute_path/lon /absolute_path/time";
  3. :coordinates = " relative_path/lat relative_path/lon relative_path/time";

Option 1 uses scoping rules to disambiguate which variables the coordinates attributes refer to. Scoping obviates the need for relative or absolute paths. Datasets created this way can be most easily dismembered and reassembled without invalidating the metadata, or needing to recompute paths based on the locations of groups in the new file.

Option 2, full paths, will lead to orphaned coordinates once the original group is extracted into a new file. So will Option 3, except in the degenerate case where the lat/lon/time variables are in the same group as the referring variable.

The best argument for recommending Option 2 is its specificity--- it's unambiguous. Option 3 is less specific, yet subject to the same downsides as Option 2 after downstream processing. Hardcoded paths will lose their self-consistency as users dream of new ways to recombine and aggregate measurements and models.

The most elegant and resilient solution seems to be Option 1. Scoping variables with the same rules as dimensions maintains consistency in the "coordinates" attributes at all processing stages.

I see some virtue of supporting Option 2 and Option 3, namely, CF2 ought to support full names instead of short names in all, or at least most, instances, to exploit the CDM. Thus I advocate Option 1 as the recommended solution to Maarten's issue, while retaining Options 2 and 3 as "legal", yet not recommended.

Finally, Maarten noted that the existing API could be enhanced to more easily locate the nearest "in-scope" variable. Should Option 1 gain traction, it would be important to follow-up on this to ensure it is easier for tool developers to implement. We had a tough time implementing support for Option 1 when we implemented support for extracting ancillary and coordinate variables from hierarchical datasets in NCO. Implementing options 2 and 3 was straightforward. With a few additions to the netCDF library API, Option 1 would be as easy to implement, and a more desirable and robust design as well.

Best, Charlie

MaartenSneepKNMI commented 9 years ago

I'm not sure I would endorse the implicit scoping and make the explicit 'legal but not recommended', but it is an option. I prefer to be explicit. If you are going to copy variables, you'd better ensure that the correct support variables come along anyway. I'd rather create a broken file (that will fail to validate noisily) than a file that appears to be correct, but references the wrong data.

czender commented 9 years ago

I hope CF2 allows full names and short names in most cases. When short names are used, the convention would be that the most in-scope variable with the short name matches. When long names are used, then there is no ambiguity. Relative names (not beginning with a slash) could also be supported. Absolute paths are likelier to break in downstream processing when things are moved around. Implicit scoping is truer to the relational properties and inheritance implied by the CDM. I see no reason why both implicit scoping and absolute paths cannot both be CF-compliant. I advocate implicit scoping over explicit long names where possible because the former preserves more of an object oriented model while the latter seems fragile. I appreciate that when things break they should break noisily, however, that does not seem to me to warrant sacrificing the elegance of the extended CDM.

MaartenSneepKNMI commented 9 years ago

There is one objection I have against following the scoping rules for dimension. We are talking here about support variables. If you want to assign a hierarchy, then dimensions are higher up in the tree, because they are needed to define a variable. This order is regulated at the netCDF-4 level, and is fine. The support variables that are linked via attributes are support variables, items like processing flags, or other ancillary variables. Conceptually I'd say these are below the main variable. Following the same scoping rules doesn't make sense to me. In many cases most users shouldn't have to look at processing flags, and you want these variables at a lower level (deeper into the file hierarchy). This is precisely opposite to what the scoping rules for dimensions dictate. On the other hand, a variable like geolocations are pretty essential, and you want to have those easily visible.

So I'm curious as to what scoping rules for implicit finding of variables are proposed? How do you deal with the potential appearance of variables with the same name in different groups? Implicit may be nice, but somewhere it should be defined what is meant (i.e. in CF-2, if we choose to go this route). I'm also curious for the reasoning behind your proposed rules, as both cases of hierarchy exist, and I don't think you can cover both implicitly.

And finally: I don't see how fragility can be an issue: you will have to ensure that support variables are copied anyway and referenced correctly. With explicit referencing you'll notice errors far quicker. With implicit referencing you may accidentally refer to a variable that is present in the destination file, but is different from the intended support variable (same name, different contents, say the latitude and longitude from a different granule altogether).

czender commented 9 years ago

Hi Maarten,

My suggestion is that CF2 recommend that support variables be identified by implicit scoping where possible, and by relative or aboslute paths where necessary (i.e., in groups outside the scope of the host variable). That is for dataset creation. For dataset reading, a support variable would then be searched for according to how it is specified. No slashes means implicit scoping. A non-leading slash means relative path. A leading slash means absolute path. No ambiguity there, and the support variables may be anywhere in the file. Unlike you, I expect the dominant use case across all users will be that support variables will be in scope of the variables, but that remains to be seen. Your point is well-taken that there are conceptual reasons to store some support variables beneath the host variable, though I think this is unlikely for "coordinates", and "ancillary" is anyone's guess.

Now let me address your questions in reverse order: Variable with the same name in different groups are not a problem. Scoping or relative/absolute paths find the specified variable. Fragility is an issue because, based on my experience doing this with NCO, it is much easier to write a tool that moves or subsets groups from a file into a new arrangement without altering the paths in the "coordinates"/"ancillary" attributes than to write a tool that subsets and does alter those contents (because the netCDF API does nothing to keep the metadata consistent). Therefore people will create such tools. They will do the job well enough for many purposes, and lead to more broken links when support variables are identifed with absolute paths.

Best, c

dblodgett-usgs commented 5 years ago

Fixed in https://github.com/cf-convention/cf-conventions/pull/145