pvanlaake commented 1 month ago

Conventions for placing CF constructs in a group hierarchy

Moderator

To be decided

Moderator Status Review [last updated: 2024-07-29]

New issue created 2024-07-29, based on a discussion in https://github.com/orgs/cf-convention/discussions/333

Requirement Summary

NetCDF-4 introduced the concept of groups: a directory-like structure that distributes the contents of a data set over multiple groups. "They can be used to organize large numbers of variables" (netCDF User's Guide). The CF Conventions allow the use of groups, elaborating on scoping rules and introducing some restrictions on placement of coordinate variables and attributes. There is no guidance on how to organize large numbers of variables. This absence of guidance creates ambiguity and leaves open the possibility of creating unnecessarily complex data sets, complicating the task of software readers to correctly interpret the data set.

A CF-compliant data set may consist of a few netCDF variables (say, a data variable and three or four coordinate variables) that do not require specific guidance beyond what is currently in the convention text. There are also data sets, however, that are composed of many more netCDF variables (say, an atmosphere product of the Arctic region with XY axes with auxilliary lat-lon coordinate variables, a parametric Z axis with two terms, and a T axis, all with their bounds variables, and grid mapping - that's 13 netCDF variables associated with 1 data variable). Such more complex data variables can be organized in groups in different ways. Currently, data producers have no guidance from the conventions in terms of a preferred, recommended or required design pattern. Likewise, data readers have no a priori intel to interpret a data set and thus have to implement comprehensive search algorithms to locate CF constructs distributed over multiple groups.

As multiple data variables are added to a data set (the raison d'être for using groups), the need for structuring the contents of the data set quickly becomes apparent for two principal reasons:

Like-things tend to have like-names. Using groups, which are effectively namespaces, it is easier to identify which netCDF variables compose a data variable and name collisions can be avoided through application of the scoping rules.
Having multiple data variables in a single data set opens up the possibility of sharing CF constructs between data variables, such as grid mapping variables, domain variables, and coordinate variables (with their associated ancillary coordinate variables).

To fully unlock the potential of groups, while avoiding a proliferation of different approaches to organizing and sharing constructs, it is suggested to add guidance to the convention document to aid data producers in using a design pattern that is easily understood by software readers and end users.

Technical Proposal Summary

The intent of this proposal is to define a (small set of) general principle(s) on the distribution of CF constructs encompassing one or more data variables over multiple groups in a single data set. Based on the general principle(s), define a handful of conventions that provide practical guidance to data producers and readers. The agreed text of this proposal, if and when that stage is reached, is then to be integrated into the current section 2.7 of the conventions document, with clarifications and cross-references added to other sections of the conventions document.

This proposal centers on two main CF constructs: the data variable (DV) and the coordinate variable (CV). The logic behind this is that other CF constructs are typically related to one or the other in a dependent manner while the use of the additional constructs is (mostly?) mutually exclusive between the DV and the CV. Notable exceptions are grid mapping variables (these could be labeled naturally general as their definition does not depend on any other CF construct) and scalar coordinate variables (which for purposes of this discussion can be grouped with CVs).

General principle

Relative to a DV or a CV, elements that are general should be placed in the group of the DV or CV or an ancestor group thereof, while elements that are specific to the DV or CV should be placed in the group of the DV or CV or a child group thereof.

"General" are those elements that could (potentially) be shared between DVs or CVs. This includes grid mapping variables, CVs (for DVs), and terms for parametric vertical coordinates (for CVs). When general elements are shared between DVs or CVs in a data set and those DVs or CVs are located in multiple groups, the general elements should be located in an ancestor group of all affected DVs or CVs.

"Specific" are those elements that are applicable to a single DV or CV within the data set.

Conventions

It is recommended to locate dimensions in the root group, unless the nature of the data set calls for a different organization. Dimensions may be redefined in a group lower in the hierarchy which then becomes the local apex group for variables located in that group or below the group with the dimensions.
When making out-of-group references, relative paths from the referring group are preferred over absolute paths.
Related elements should be located as closely together in the group hierarchy as is practical, preferring ancestor - descendant references over lateral ones (sibling groups). A general pattern is then that dimensions are defined in (or close to) the root group, followed by coordinate variables, then finally data variables.
Elements that are specific to a data variable or a coordinate variable should be located in the same group as the referencing data variable or coordinate variable, or a child group thereof. As an example, a bounds variable should be located in the same group as, or a child group of the referencing coordinate variable, scalar coordinate variable or auxiliary coordinate variable. [[Note: include a table to list specific constructs where this convention applies?]]

Terminology

The following changes to terminlogy in section 1.3 of the conventions document are proposed:

(new) absolute path: The path to a group that starts from the root group '/'.
ancestor group: The group from which the referring group is descended via a direct parent-child relationship over one or more levels.
(new) child group: A group descending directly from the referring group.
local apex group: From a referring group containing a data variable, the ancestor group in which a dimension of an out-of-group coordinate variable is defined. The word "apex" refers to the position of this group at the vertex of the tree of groups formed by it, the referring group, and the group where the coordinate variable is located.
nearest item: The variable or dimension that can be reached via the shortest traversal from the referring group using search by proximity, as set forth in the Section 2.7, "Groups".
(new) parent group: The direct ancestor group of the referring group.
path: A path is a sequence of group names from a referring group to another group, separated by a forward slashes '/'. Paths must follow the UNIX-style path convention and may begin with either a '/', '..', or a group name.
relative path: The path to a group that starts from the referring group. Traversal of the group hierarchy is downwards (away from the root group). Upwards traversal is indicated by starting the path with one or more '..' (travel up one group), separated by '/'.

Cross-references and edits to other sections

Cross-references from various sections to this text. Appendix I.

To be completed

Examples

netcdf group-redefinedT {
  :title = "Demonstration of grouping with redefined T axis"
  :Conventions = "CF-1.12" ;  // that's the ambition anyway

  dimensions:
    lat = 180 ;
    lon = 360 ;
    time = UNLIMITED ;

  variables:
    double lat(lat) ;
      lat:long_name = "latitude" ;
      lat:units = "degrees_north" ;
    double lon(lon) ;
      lon:long_name = "longitude" ;
      lon:units = "degrees_east" ;
    double time(time) ;
      time:long_name = "time" ;
      time:units = "days since 1991-01-01 00:00:00" ;
      time:calendar = "standard" ;

  group: gridded_observations {
    :comment = "Data variables in this group using root group CVs" ;

    variables:
      float tas(time, lat, lon) ;
        tas:long_name = "Surface air temperature" ;
        tas:units = "K" ;

      float prec(time, lat, lon) ;
        prec:long_name = "Precipitation flux" ;
        prec:units = "kg m-2 s-1" ;
  }

  group: climatology {
    :comment = "Climatology 1991 - 2020 by month of gridded observations" ;

    dimensions:
      time = 12 ;
      nv = 2 ;

    variables:
      double time(time) ;
        time:long_name = "time" ;
        time:units = "days since 1991-01-01 00:00:00" ;
        time:calendar = "standard" ;
        time:climatology = "bounds/climatology_bounds" ;
        time:comment = "Redefined time dimension for climatology, bounds in child group" ;

      float tas(time, lat, lon) ;
        tas:long_name = "Surface air temperature" ;
        tas:comment = "Using root group CVs for lon and lat, time CV redefined locally" ;
        tas:cell_methods = "time: mean within years time: mean over years" ;
        tas:units = "K" ;

    group: bounds {
      :comment = "Dimensions from parent group" ;

      double climatology_bounds(time, nv) ;
    }
  }
}

More to be added.

Benefits

The proposed conventions will aid data producers in defining a design pattern for placing multiple data variables in a single data set. Data readers will benefit from interpreting data sets using groups with a compact yet complete set of guidelines that are not very dissimilar from reading a "flat" data set.

Status Quo

The conventions currently provide the option of using groups in netCDF-4 files with scoping rules and a few restrictions on coordinate variables and attributes but there is no guidance on how to distribute the netCDF variables that make up a single CF data variable over groups.

Associated pull request

A pull request has not yet been created.

davidhassell commented 1 month ago

Hello Patrick,

Thanks for putting this together. It is very clear, but I'm afraid that I'm not yet convinced by some of this. I like the new terminology definitions, but would like to see some more reasons why the new recommendations are as they are, as I currently think that they may not be necessary.

A few points/questions:

Likewise, data readers have no a priori intel to interpret a data set and thus have to implement comprehensive search algorithms to locate CF constructs distributed over multiple groups.

To fully unlock the potential of groups, while avoiding a proliferation of different approaches to organizing and sharing constructs, it is suggested to add guidance to the convention document to aid data producers in using a design pattern that is easily understood by software readers and end users.

Even if it was some a priori intel, the full search algorithm defined by CF will still need to be applied. I suppose you could write library software that employed a search algorithm that only looked in places according to these recommendations, but that would fail on CF-compliant files that didn't adhere to them, so I doubt anyone would do that!

As multiple data variables are added to a data set (the raison d'être for using groups),

I would say that there is no de facto reason for using groups. For instance:

It is just as reasonable to use groups to provide structure to the metadata for a dataset containing a single data variable as it is to provide structure to the metadata shared between multiple data variables.
Adding structure to purely encoding variables (such as DSG count and index variables).

I don't think we should restrict the data writer when they are creating a structure for a dataset, happy in the knowledge that the well-defined search algorithm will be applied by the data reader. For instance - where should you put an orography data variable (with its own referenced variables) that is also used as a formula term to parametric vertical coordinate variable? I don't think there is any right answer to that ...

When making out-of-group references, relative paths from the referring group are preferred over absolute paths.

Why is this preferred?

Thanks for your patience, David

pvanlaake commented 1 month ago

Hello David @davidhassell, I see your points on the language in the Requirements Summary and that can definitely use some tidying up. None of that is intended to make it into the conventions document, though.

On the "convincing" part of your post: section 2.7 currently has no conventions whatsoever that would guide data writers (not "restrict", as you mention, the four conventions are stated as "recommendations", "should" instead of "shall") on how to distribute CF constructs and their defining netCDF dimensions and variables over a data set using groups. Those four recommendations are supporting the development of new data collections, not to invalidate existing ones or coerce data writers to apply a specific approach.

While it is certainly possible (and not particularly difficult) to implement the scoping rules in reader software, there is also beauty in simplicity that comes from following conventions. Your question on relative versus absolute paths is a case-in-point: with relative paths the structure between the CF constructs and netCDF elements in the data set becomes apparent. A long relative path is an indication of a potential logical design flaw in the data organisation, even if it is fully compliant with the conventions. Absolute paths are a "lazy" and "blind" approach in that perspective.

cf-convention / cf-conventions