Open aidanheerdegen opened 10 months ago
@headmetal before I forget you might be interested in this feature, as it would allow your live stats tool to discover datasets that can be compared with the live-tracked model.
@headmetal before I forget you might be interested in this feature, as it would allow your live stats tool to discover datasets that can be compared with the live-tracked model.
Can we use the resolution information given to us in the metadata to do this?
Example metadata available from the access-nri-intake-catalog
from the live diagnostics gui:
Can we use the resolution information given to us in the metadata to do this?
In some cases yes: assuming that information is supplied, you're working with a related set of experiments and the encoding is consistent, e.g. '1 degree' means the same thing across unrelated experiments.
The proposal above is partly about creating some standard names we can use to identify common grids. As an example, CICE
has some common grids defined
ICE_GRID | string (see below) | grid | |
---|---|---|---|
gx3 | 3-deg displace-pole (Greenland) global grid | ||
gx1 | 1-deg displace-pole (Greenland) global grid | ||
tx1 | 1-deg tripole global grid | ||
gbox80 | 80x80 box | ||
gbox128 | 128x128 box |
It is typical that the ocean/ice models use Arakawa grids, e.g. MOM5 is a B-grid model, MOM6 is C-grid. This means there are are intersecting grids in the models. In some cases there are diagnostics that are output on either the tracer grid (T-grid) or the velocity grid (U-grid). There are even some diagnostics that have a hybrid of the grids with one coordinate T-grid, and the other coordinate U-grid.
In the case of diagnostics with mixed coordinates, if there is a reduction along one of the horizontal spatial coordinates, e.g. mean
, then it would be then be fine to do operations with grids that matched it's remaining spatial coordinate.
If there was a way of matching those coordinates then compatibility could be automatically discovered/determined.
I like this idea. There are a few fiddly bits that come to mind that I'm noting down while they're in my head.
It could be difficult to robustly identify which coordinates on a variable are related to the grid and which aren't. As an example, consider the CICE output files, e.g.:
In [1]: ds = xr.open_dataset("/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-01.nc")
In [2]: ds["aice_m"].coords
Out[2]:
Coordinates:
* time (time) object 1900-02-01 00:00:00
TLON (nj, ni) float32 ...
TLAT (nj, ni) float32 ...
ULON (nj, ni) float32 ...
ULAT (nj, ni) float32 ...
Only TLON
and TLAT
specify the grid for this variable. Yes, in this particular case the relevant spatial coordinate names are available in the metadata, but in the general case it could be difficult to extract grid-relevant coordinates.
Grid information is often stored in a separate file. E.g., MOM output files often only include 1D spatial "pseudo"-coordinates (e.g. xt_ocean
, yt_ocean
etc) with the true 2D grid info stored elsewhere. This is really a problem for the post-processing tool that generates the checksums: the checksum attributes on the 1D "pseudo"-coordinates should be computed from the true 2D coordinates.
The value here is the ability to find experiments on the same grid. As noted, it would be nice to be able to search on a particular grid, in which case a standard naming convention is helpful/needed. But as noted elsewhere the "same grid" will have different hashes due to floating point differences, masking etc. I'm therefore confused about what this naming convention would look like. Something like the CICE convention would result in a one to many relationship between grid names and grid hashes? How can we validate this relationship? (I'm possibly/probably misunderstanding something here).
Only
TLON
andTLAT
specify the grid for this variable. Yes, in this particular case the relevant spatial coordinate names are available in the metadata, but in the general case it could be difficult to extract grid-relevant coordinates.
cf_xarray is your friend
In [5]: ds = xr.open_dataset("/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-01.nc")
In [6]: ds["aice_m"].cf
Out[6]:
Coordinates:
CF Axes: X, Y, Z, T: n/a
CF Coordinates: longitude: ['TLON']
latitude: ['TLAT']
vertical, time: n/a
Cell Measures: area, volume: n/a
Standard Names: n/a
Bounds: n/a
Grid Mappings: n/a
I would not suggest doing this without something like cf_xarray to do the inspection (even though I failed to suggest using it above).
Grid information is often stored in a separate file. E.g., MOM output files often only include 1D spatial "pseudo"-coordinates (e.g. xt_ocean, yt_ocean etc) with the true 2D grid info stored elsewhere. This is really a problem for the post-processing tool that generates the checksums: the checksum attributes on the 1D "pseudo"-coordinates should be computed from the true 2D coordinates.
I don't think I would recommend that approach. Instead I would say build into the grid information tool the idea of a hierarchy of grids, so you can say one grid is equivalent to another, but they have some "quality metric" to say one is superior, i.e. unmasked. Also it should be possible to map from a 1D to a 2D curvilinear grid and request the "best quality" grid with the required dimensionality.
- Something like the CICE convention would result in a one to many relationship between grid names and grid hashes? How can we validate this relationship?
See above. It would require some manual intervention at some point. When a new grid is encountered it might need some inspection to see if it is just another version of an existing well known grid, so that mapping could be added. I can think off the top of my head some heuristics to check if the mappings could be done semi-automatically, e.g.
I don't think I would recommend that approach. Instead I would say build into the grid information tool the idea of a hierarchy of grids, so you can say one grid is equivalent to another, but they have some "quality metric" to say one is superior, i.e. unmasked. Also it should be possible to map from a 1D to a 2D curvilinear grid and request the "best quality" grid with the required dimensionality.
The 1D coordinates don't uniquely describe the 2D coordinates. Is it safe to assume they do?
The 1D coordinates don't uniquely describe the 2D coordinates. Is it safe to assume they do?
Safe enough for our purposes I think. Those connections will mostly be done in a curated way, not automatically, so we'd only be concerned about false positives, and I think they'd be fairly unlikely, and mostly harmless if they did occur.
Does specifying the "coordinates" attribute make this easier? (e.g. https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables)
e.g. following the cice example, aice_m in om2 has:
ds.aice_m
<xarray.DataArray 'aice_m' (time: 1, nj: 2700, ni: 3600)>
[9720000 values with dtype=float32]
Coordinates:
* time (time) object 1900-02-01 00:00:00
TLON (nj, ni) float32 ...
TLAT (nj, ni) float32 ...
ULON (nj, ni) float32 ...
ULAT (nj, ni) float32 ...
Dimensions without coordinates: nj, ni
Attributes:
units: 1
long_name: ice area (aggregate)
cell_measures: area: tarea
cell_methods: time: mean
time_rep: averaged
But if we add to the attributes:
coordinates : TLON TLAT
then the grid should be uniquely defined?
Is your feature request related to a problem? Please describe.
The resolution of a dataset is an important piece of information. It can be critical when searching for data to know the resolution, as the representation of physical processes is typically dictated by the resolution of a model. The information on the resolution of a dataset is encoded in the underlying grid coordinates.
Also when comparing datasets knowing what other datasets use the same grid is very useful information, as it allows a comparisons to be made without any time consuming, and often technically demanding, regridding.
Describe the feature you'd like
I want everything, but would be content with a system that extracts grid information when the netCDF files are opened and inspected during the cataloguing process, and saves that grid information in a form that can be queried.
One suggestion is to save grid information into a complementary catalog or tool that can be queried independently of the main ACCESS-NRI Intake Catalog, somewhat similar to the variable suggester tool (#26)
There are number of increasingly convoluted thought bubbles about how to uniquely identify grids in this issue, but the gist is this:
Assuming there are
md5
checksums attributes for all coordinate variables (ideally a independent post-processing step) something like this pseudo-code:Where
coordinates
andgrids
would be serialised to a catalog that could be queried. Thedataset
catalog is already serialised, but thegrid_id
's found indataset
would be added todataset
metadata.So queries could be done to retrieve which datasets contained a given
grid
, and it should be possible to provide a function or the logic required to say if two datasets share a common grid.An issue for the MOM data is that masked data (which is most of it) also has masked coordinates. In an ideal world a post-processing step would fix that, but it hasn't been done in the past so will need to be supported.
In the linked COSIMA Cookbook issue I suggested we could augment the grid information with a
metadata.yaml
file that could give grids useful human-readable names, but also define relationships between grids, perhaps defining the grids with missing data to be equivalent to unmasked grids.Apologies if I've used the wrong terminology above, e.g. datasets, and so made this needlessly confusing.