cf-convention / discuss

A forum for any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.
43 stars 6 forks source link

Proposal for a hierarchical parent-child metadata standard #62

Open deeplycloudy opened 4 years ago

deeplycloudy commented 4 years ago

Title

CF-Tree: Hierarchical parent-child data in the Climate and Forecast Metadata framework

Authors

Eric Bruning (@deeplycloudy): Texas Tech University, Lubbock, TX Ethan Davis (@ethanrd), Ryan May (@dopplershift), Sean Arms (@lesserwhirls): UCAR/Unidata, Boulder, CO.

Requirement Summary

This is a proposal to define a formal metadata standard under the CF conventions for hierarchical parent-child relationships of arbitrary depth, for data with zero to many associated spatiotemporal or other dimensions. We propose the name CF-tree to help the user picture the data linkages implied by the metadata.

Much of the text here is repeated from the complete proposal linked below, which should be fully open for comments. I kept things brief here out of courtesy, but am not opposed to pasting the complete proposal as necessary. I assume the complete proposal text, if encouraged, will eventually need to be posted in full as an issue/PR on the cf-conventions repository.

Technical Proposal Summary

The datasets to which this proposed standard applies have in common a parent-child ID variable that links two or more tree dimensions. Other variables might be associated with these IDs by dimension, or by explicit use of the ID key on a variable that does not share the dimension.

Such data structures are like the foreign key relationships used in databases. They specify a one-to-many hierarchical relationship. This may be visualized as a directed graph down a tree. Other relevant theory includes connected-component labeling.

Recognizing these common features, the mind wanders to many to many relationships, and arbitrary and possibly directed graph structures, and even specification of unstructured grids or better, standardized handling of ragged arrays. However, those applications are out of scope for this proposal, which is focused on the more straightforward one-to-many problem.

We also note that database-like groupby functionality and labeled coordinate indexing exists in two popular Python data science libraries, pandas and xarray. The latter is aimed at extending the ideas in pandas to multidimensional data, and seeks to implement the CF conventions. In our proof of concept implementation of machine-facilitated traversal of the Geostationary Lightning Mapper data tree, we recognized several fundamental operations, which we implemented in a generic way using xarray.groupby:

Benefits

Development of a concrete metadata standard would stimulate progress toward a standardized implementation of machine-automated traversal of hierarchical tree structures in CF-honoring packages (e.g., xarray) that are directly used by domain science practitioners.

This proposal arises from our work with the Geostationary Lightning Mapper (GLM) on the GOES-16 and GOES-17 meteorological satellites, which already implement the data model proposed herein; we are proposing a formalization of some implicit conventions and minor additions to flag the conformance to those conventions.

The GLM traversal code example we developed to demonstrate this is open source, and includes unit tests for data structures beyond GLM itself. The future we have in mind is sufficient metadata so that xarray or similar libraries could recognize the hierarchical structure, and perform such fundamental operations for the user without the user having to walk the tree. Right now, the traversal is manually configured from user knowledge of the hierarchy’s ID variables, and probably not as generalized as it could be.

We foresee application to lightning datasets, thunderstorm cell tracking, and weather and climate model validation, among others as detailed in the complete proposal linked below.

Status Quo

Our proposal is related to some elements of the discrete sampling geometry standards, but extends those ideas to generalized one-to-many foreign key-type data models.

Associated pull request

None yet. Please see some proposed uses of CF-tree, including draft format specifications, as linked in the detailed proposal.

Detailed Proposal

Please see the Google Doc we prepared as part of our development of this proposal.

ethanrd commented 4 years ago

Hi all - The GOES Geostationary Lightning Mapper (GLM) data is stored as netCDF-4 files containing CF DSG point data (though running one through the CF checker gave me some errors). Each file contains three different point data features each with its own observation dimension. The three point data features (events, groups, and flashes) are related by a hierarchical parent-child relationship. It is this hierarchical parent-child relationship the CF-Tree proposal would like to standardize.

In the GLM data, each event represents a pixel where lightning was detected during a time step. Each group represents a cluster (in time and space) of events. Each flash represents a cluster (again in time and space) of groups.

Here’s an example CDL (a simplified version of the CDL in the CF-Tree Google Document that Eric, @deeplycloudy, referenced above):

  dimensions:
      event = 10 ;
      group = 6 ;
      flash = 2 ;

  variables:
      float event_lat(event) ;
      float event_lon(event) ;
      float event_time(event) ;
      float event_energy(event) ;
          event_energy:coordinates = "event_time event_lat event_lon" ;

      int event_parent_group_index(event) ;  // Contains the index of the containing group.
                                                                       // Connects each event to a single group
                                                                       // (each group contains multiple events).
          event_parent_group_index:instance_dimension = "group"

      float group_lat(group) ;
      float group_lon(group) ;
      float group_time(group) ;
      float group_energy(group) ;
          group_energy:coordinates = "group_time group_lat group_lon" ;
      float group_area(group) ;
          group_area:coordinates = "group_time group_lat group_lon" ;

      int group_parent_flash_index(group) ;  // Contains the index of the containing flash.
                                                                      // Connects each group to a single flash
                                                                      // (each flash contains multiple groups).
          group_parent_flash_index:instance_dimension = "flash" ;

      float flash_lat(flash) ;
      float flash_lon(flash) ;
      float flash_time(flash) ;
      float flash_energy(flash) ;
          flash_energy:coordinates = "flash_time flash_lat flash_lon" ;
      float flash_area(flash) ;
          flash_area:coordinates = "flash_time flash_lat flash_lon" ;

The above CDL is the bare bones of the CF-Tree proposal. The CF-Tree document proposes some explicit declarations of the various roles (cf_role = tree_id”) and relationships (new parent and child attributes). It also uses ID values instead of index values to connect parent to child. I've simplified and changed the CDL a bit to clarify some connections to current CF constructs. For instance, the event_parent_group_index and group_parent_flash_index variables are similar to the index variable described in CF Section 9.3.4. “Indexed ragged array representation”. So I've added instance_dimension attributes above (that aren't in the CF-Tree doc to indicate the "parent" instance index.

So, to get started, any thoughts on how/where this might fit in CF? To me, the CF-Tree concept seems related to Cell Bounds and Geometries -- in the GLM example, the events in a particular group represent that group’s cell bounds (or extent). On the other hand, while all the examples in the CF-Tree document involve clustering along (space/time) coordinates, the CF-Tree construct isn't explicitly related to the coordinates.

@davidhassell - Any other data model thoughts?

[In case its useful, current GLM datasets are a bit different and can be found here. Just navigate down to a recent dataset, selecting the “CdmRemote” access method will return the CDL for the dataset.]

JonathanGregory commented 4 years ago

Dear @ethanrd

I think you could regard this as a DSG with two element dimensions, like timesSeriesProfile and trajectoryProfile. In terms of Table 9.1, a flash is an instance i, with data(i,p,o), where p is the group and o the event. The event is located at x(i,p,o) y(i,p,o) t(i,p,o). The coordinates have different dimensionality from others in the table, so this is a new featureType. In addition to the event coordinates, you also supply representative data(i,p) x(i,p) y(i,p) t(i,p) for the group, and data(i) x(i) y(i) t(i) for the flash, but that doesn't affect the logical structure of the DSG.

It could be called a flash feature, but maybe that's too restrictive since it's possibly useful in other contexts. Maybe we could call it something like pointGroupGroup - you can imagine a pointGroup too, which isn't sufficient for your application.

Best wishes

Jonathan

davidhassell commented 4 years ago

Hi @ethanrd,

Thanks for this summary - very useful for getting into it.

I'm thinking that this fits in logically with cell methods.

It seems different to the current encoding of DSG that @JonathanGregory describes (for which thanks). It feels instead more like a modified cell methods, that re-uses some of the DSG machinery.

In your example, one would need to know what the relationship between two adjacent tree elements is. For example, you have connected a subset of groups to a particular flash, but I don't know how those groups have been combined to produce each flash value.

What I'm thinking of is cell methods modified such that in the usual "name: method", the name is actually another data variable, or other dimension, rather than a dimension of the data or a standard name. In essence, this would tell us that the data values in our data variable are in fact a function of the elements of another data variable (rather than a function of elements of a higher resolution version of itself, as is usual). This also provides a link from the data variable to the next parent level in the tree, i.e. the adjacent one with finer granularity, although that link will also always be explicit through other data variable attributes.

In your CDL this could look something like:

dimensions:
      event = 10 ;
      group = 6 ;
      flash = 2 ;

  variables:
      float event_lat(event) ;
      float event_lon(event) ;
      float event_time(event) ;
      float event_energy(event) ;
          event_energy:coordinates = "event_time event_lat event_lon" ;
          event_energy:cell_method = "event: point" ;

      int event_parent_group_index(event) ;
          event_parent_group_index:instance_dimension = "group"

      float group_lat(group) ;
      float group_lon(group) ;
      float group_time(group) ;
      float group_energy(group) ;     // Values are a function of one other data variable
          group_energy:coordinates = "group_time group_lat group_lon" ;
      group_energy:cell_method = "event_energy: mean" ;
          group_energy:child = "event_energy"         // Variable
      float group_area(group) ; // Values are NOT a function of  a single other data variable
          group_area:coordinates = "group_time group_lat group_lon" ;
          group_area:cell_method = "area: sum (the area spanned by events)" ;
          group_area:child = "event" ;               // Dimension

      int group_parent_flash_index(group) ;
         group_parent_flash_index:instance_dimension = "flash" ;

      float flash_lat(flash) ;
      float flash_lon(flash) ;
      float flash_time(flash) ;
      float flash_energy(flash) ;
          flash_energy:coordinates = "flash_time flash_lat flash_lon" ;
          flash_energy:cell_method = "group_energy: mean" ;
          flash_energy:child  = "group_energy"       // Variable
      float flash_area(flash) ;
          flash_area:coordinates = "flash_time flash_lat flash_lon" ;
          flash_area:cell_method = "group_area: sum"
          flash_area:child  = "group_area"       // Variable

The first thing to note is that we may, quite correctly, not want or have actual cell method attributes to attach, so we still need an explicit indicator of the child (data variable or dimension), along the lines of the child attribute suggested in the CF-Tree Google Document.

This example would be interpreted as follows:

Data model

The data model would not be affected if this were only fancy compression (like DSG), but this doesn't look that to me. Based on what I've been thinking above, this looks like it is straying into the realm of connecting two field constructs in such a a way one logically depends on the other. This would be a new concept. (It is also something that linking vector components would want to do, for example.)

sadielbartholomew commented 4 years ago

Hi all,

I've read through the proposal and comments and skim-read the detailed proposal. It seems to me that there are good motivations and benefits, though I am still getting my head around the specific nature of the proposal.

I thought I'd comment because there is a lot of text, with some CDL examples, to describe the proposal and the context that motivated it, without any diagrams except the figure from the paper Bruning et al. included in the external document which is specific to your use-case.

Personally I think it is easier to understand the meat of the proposal if it is abstracted out to not refer to any particular use context, and think others may find that too (though I appreciate it is important to demonstrate why the idea would be helpful in practice). Therefore I think it would be very helpful to have a diagram if you could produce some sort of schematic that covers the abstract nature of the proposal (no mention of GLM, flashes, etc.). The following:

They specify a one-to-many hierarchical relationship. This may be visualized as a directed graph down a tree.

implies the general idea so I suppose it could be a tree with labelling to describe for instance what can and cannot be inherited from parent to child, and/or what must be defined on a parent and on a child.

UML or similar for the inheritance details would also be really useful, if you were able to illustrate in that way.

Would it be possible for you to create a diagram to illustrate the key ideas in the abstract? I don't think it would need to be very detailed to be useful (at the very least to me!) Thanks.