aidanheerdegen commented 10 months ago

Is your feature request related to a problem? Please describe.

The resolution of a dataset is an important piece of information. It can be critical when searching for data to know the resolution, as the representation of physical processes is typically dictated by the resolution of a model. The information on the resolution of a dataset is encoded in the underlying grid coordinates.

Also when comparing datasets knowing what other datasets use the same grid is very useful information, as it allows a comparisons to be made without any time consuming, and often technically demanding, regridding.

Describe the feature you'd like

I want everything, but would be content with a system that extracts grid information when the netCDF files are opened and inspected during the cataloguing process, and saves that grid information in a form that can be queried.

One suggestion is to save grid information into a complementary catalog or tool that can be queried independently of the main ACCESS-NRI Intake Catalog, somewhat similar to the variable suggester tool (#26)

There are number of increasingly convoluted thought bubbles about how to uniquely identify grids in this issue, but the gist is this:

Assuming there are md5 checksums attributes for all coordinate variables (ideally a independent post-processing step) something like this pseudo-code:

for variable in indexed_variables:
    for coord in variable.coords:
        grid = []
        if coord.attr['md5'] not in coordinates:
            coordinates[coord.attr['md5']] = coord
        grid.append[coord.attr['md5']]
    if grid not in grids:
        grid_id = grids.add(grid)
    grid_id = grids[grid]
    dataset.add(grid_id)

Where coordinates and grids would be serialised to a catalog that could be queried. The dataset catalog is already serialised, but the grid_id's found in datasetwould be added to dataset metadata.

So queries could be done to retrieve which datasets contained a given grid, and it should be possible to provide a function or the logic required to say if two datasets share a common grid.

An issue for the MOM data is that masked data (which is most of it) also has masked coordinates. In an ideal world a post-processing step would fix that, but it hasn't been done in the past so will need to be supported.

In the linked COSIMA Cookbook issue I suggested we could augment the grid information with a metadata.yaml file that could give grids useful human-readable names, but also define relationships between grids, perhaps defining the grids with missing data to be equivalent to unmasked grids.

Apologies if I've used the wrong terminology above, e.g. datasets, and so made this needlessly confusing.

aidanheerdegen commented 10 months ago

@headmetal before I forget you might be interested in this feature, as it would allow your live stats tool to discover datasets that can be compared with the live-tracked model.

headmetal commented 10 months ago

@headmetal before I forget you might be interested in this feature, as it would allow your live stats tool to discover datasets that can be compared with the live-tracked model.

Can we use the resolution information given to us in the metadata to do this?

Example metadata available from the access-nri-intake-catalog from the live diagnostics gui:

aidanheerdegen commented 10 months ago

Can we use the resolution information given to us in the metadata to do this?

In some cases yes: assuming that information is supplied, you're working with a related set of experiments and the encoding is consistent, e.g. '1 degree' means the same thing across unrelated experiments.

The proposal above is partly about creating some standard names we can use to identify common grids. As an example, CICE has some common grids defined

https://cice-consortium-cice.readthedocs.io/en/cice6.1.0/user_guide/ug_case_settings.html?highlight=gx1#table-of-cice-settings

ICE_GRID	string (see below)	grid
	gx3	3-deg displace-pole (Greenland) global grid
	gx1	1-deg displace-pole (Greenland) global grid
	tx1	1-deg tripole global grid
	gbox80	80x80 box
	gbox128	128x128 box

It is typical that the ocean/ice models use Arakawa grids, e.g. MOM5 is a B-grid model, MOM6 is C-grid. This means there are are intersecting grids in the models. In some cases there are diagnostics that are output on either the tracer grid (T-grid) or the velocity grid (U-grid). There are even some diagnostics that have a hybrid of the grids with one coordinate T-grid, and the other coordinate U-grid.

In the case of diagnostics with mixed coordinates, if there is a reduction along one of the horizontal spatial coordinates, e.g. mean, then it would be then be fine to do operations with grids that matched it's remaining spatial coordinate.

If there was a way of matching those coordinates then compatibility could be automatically discovered/determined.

dougiesquire commented 10 months ago

I like this idea. There are a few fiddly bits that come to mind that I'm noting down while they're in my head.

It could be difficult to robustly identify which coordinates on a variable are related to the grid and which aren't. As an example, consider the CICE output files, e.g.:
```
In [1]: ds = xr.open_dataset("/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-01.nc")

In [2]: ds["aice_m"].coords
Out[2]: 
Coordinates:
 * time     (time) object 1900-02-01 00:00:00
   TLON     (nj, ni) float32 ...
   TLAT     (nj, ni) float32 ...
   ULON     (nj, ni) float32 ...
   ULAT     (nj, ni) float32 ...
```
Only TLON and TLAT specify the grid for this variable. Yes, in this particular case the relevant spatial coordinate names are available in the metadata, but in the general case it could be difficult to extract grid-relevant coordinates.
Grid information is often stored in a separate file. E.g., MOM output files often only include 1D spatial "pseudo"-coordinates (e.g. xt_ocean, yt_ocean etc) with the true 2D grid info stored elsewhere. This is really a problem for the post-processing tool that generates the checksums: the checksum attributes on the 1D "pseudo"-coordinates should be computed from the true 2D coordinates.
The value here is the ability to find experiments on the same grid. As noted, it would be nice to be able to search on a particular grid, in which case a standard naming convention is helpful/needed. But as noted elsewhere the "same grid" will have different hashes due to floating point differences, masking etc. I'm therefore confused about what this naming convention would look like. Something like the CICE convention would result in a one to many relationship between grid names and grid hashes? How can we validate this relationship? (I'm possibly/probably misunderstanding something here).

aidanheerdegen commented 10 months ago

Only TLON and TLAT specify the grid for this variable. Yes, in this particular case the relevant spatial coordinate names are available in the metadata, but in the general case it could be difficult to extract grid-relevant coordinates.

cf_xarray is your friend

In [5]: ds = xr.open_dataset("/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-01.nc")

In [6]: ds["aice_m"].cf
Out[6]: 
Coordinates:
             CF Axes:   X, Y, Z, T: n/a

      CF Coordinates:   longitude: ['TLON']
                        latitude: ['TLAT']
                        vertical, time: n/a

       Cell Measures:   area, volume: n/a

      Standard Names:   n/a

              Bounds:   n/a

       Grid Mappings:   n/a

I would not suggest doing this without something like cf_xarray to do the inspection (even though I failed to suggest using it above).

Grid information is often stored in a separate file. E.g., MOM output files often only include 1D spatial "pseudo"-coordinates (e.g. xt_ocean, yt_ocean etc) with the true 2D grid info stored elsewhere. This is really a problem for the post-processing tool that generates the checksums: the checksum attributes on the 1D "pseudo"-coordinates should be computed from the true 2D coordinates.

I don't think I would recommend that approach. Instead I would say build into the grid information tool the idea of a hierarchy of grids, so you can say one grid is equivalent to another, but they have some "quality metric" to say one is superior, i.e. unmasked. Also it should be possible to map from a 1D to a 2D curvilinear grid and request the "best quality" grid with the required dimensionality.

Something like the CICE convention would result in a one to many relationship between grid names and grid hashes? How can we validate this relationship?

See above. It would require some manual intervention at some point. When a new grid is encountered it might need some inspection to see if it is just another version of an existing well known grid, so that mapping could be added. I can think off the top of my head some heuristics to check if the mappings could be done semi-automatically, e.g.

if a new grid is has missing values, could cycle through possible matching grids (e.g. same dimensions), mask both grids by the missing values and calculate hashes
calculate minimum/maximum values orthogonal to each direction (ignoring missing values) and check for the number of matching values, or do a fuzzy match within a very small tolerance and check for the number of matching values

dougiesquire commented 10 months ago

I don't think I would recommend that approach. Instead I would say build into the grid information tool the idea of a hierarchy of grids, so you can say one grid is equivalent to another, but they have some "quality metric" to say one is superior, i.e. unmasked. Also it should be possible to map from a 1D to a 2D curvilinear grid and request the "best quality" grid with the required dimensionality.

The 1D coordinates don't uniquely describe the 2D coordinates. Is it safe to assume they do?

aidanheerdegen commented 10 months ago

The 1D coordinates don't uniquely describe the 2D coordinates. Is it safe to assume they do?

Safe enough for our purposes I think. Those connections will mostly be done in a curated way, not automatically, so we'd only be concerned about false positives, and I think they'd be fairly unlikely, and mostly harmless if they did occur.

anton-seaice commented 3 months ago

Does specifying the "coordinates" attribute make this easier? (e.g. https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables)

e.g. following the cice example, aice_m in om2 has:

 ds.aice_m
<xarray.DataArray 'aice_m' (time: 1, nj: 2700, ni: 3600)>
[9720000 values with dtype=float32]
Coordinates:
  * time     (time) object 1900-02-01 00:00:00
    TLON     (nj, ni) float32 ...
    TLAT     (nj, ni) float32 ...
    ULON     (nj, ni) float32 ...
    ULAT     (nj, ni) float32 ...
Dimensions without coordinates: nj, ni
Attributes:
    units:          1
    long_name:      ice area  (aggregate)
    cell_measures:  area: tarea
    cell_methods:   time: mean
    time_rep:       averaged

But if we add to the attributes: coordinates : TLON TLAT

then the grid should be uniquely defined?

ACCESS-NRI / access-nri-intake-catalog

Encoding grid information #112

Is your feature request related to a problem? Please describe.

Describe the feature you'd like