Uniquely identifying grids

It would be good to have a mechanism to uniquely identify (hash) grids for a number of reasons:

Disambiguate data that are are in most other ways the same, e.g. regional output. Usually they would differ in some other meaningful way, for example in frequency, but it seems obvious that the querying should be able to easily distinguish between output on different grids (see https://github.com/COSIMA/cosima-cookbook/issues/137).
Automatically identify data that can be compared. Outputs from the same model are an obvious example. This is being addressed with keywords (see https://github.com/COSIMA/cosima-cookbook/issues/185 and https://github.com/COSIMA/cosima-cookbook/pull/190) but it should be possible automatically by matching grids. Another case is finding otherwise dissimilar data (observations) on the same grid.
Find canonical grid information. When attempting to plot ocean and ice output data it was not straightforward to find unmasked grid coordinates for selection and plotting (see https://github.com/COSIMA/cosima-recipes/blob/master/DocumentedExamples/Spatial_selection.ipynb). Clean, unmasked, canonical grid information could be stored in one location in the database and accessed based on a grid hash.

I did some preliminary calculations to see how feasible it would be to just compare hashes of grids

https://gist.github.com/aidanheerdegen/dd427bba8b104600a1c44b5046bfdfc9

TL;DR it isn't really feasible. Small differences at floating point accuracy and masking of land cells makes it difficult to do this automatically.

HOWEVER all is not lost. It should be possible to hash all grids. The same grids in different models have consistent hash values. This means it should be possible to create a table where relationships between grids can be specified. It shouldn't be onerous to maintain, as grids change fairly infrequently, and any data with the same grids can automatically make use of the mapping information.

The hardest part is decided how to encode this information, as we're not expecting users to break out sqlite3 and start editing tables, in fact this is expressly discouraged.

Proposal:

Create a new DB table for storing coordinate information. Something like this:

CREATE TABLE nccoords (
        id INTEGER NOT NULL,
        variable_id INTEGER NOT NULL,
        size VARCHAR,
        checksum VARCHAR,
        PRIMARY KEY (id),                                      
        FOREIGN KEY(variable_id) REFERENCES variables (id)                                         
);

where checksum is an md5 checksum of the entire coordinate variable.

Also create another new table grids where each grid entry is a unique combination of coordinates from nccoords and is something like this:

CREATE TABLE grids (
        id INTEGER NOT NULL,
        grid VARCHAR NOT NULL,
        checksum VARCHAR,
        description TEXT,
        PRIMARY KEY (id),                                                                       
);

where grid is a unique name used to identify the grid. checksum I'm not sure about. It needs to be a single unique value that will be the same if the DB is regenerated, so id is not sufficient. Naively I'd just add the checksum values from all the associated coordinates together. That approach is a disaster cryptographically speaking, but that doesn't matter for this checksum.

grids would have an association table linking each grid to the nccoords for that grid. Not sure whether ordering of coordinates would be important. The grid id would replace dimensions in ncvars:

CREATE TABLE ncvars (
        id INTEGER NOT NULL, 
        ncfile_id INTEGER NOT NULL, 
        variable_id INTEGER NOT NULL, 
        grid_id INTEGER NOT NULL, 
        chunking VARCHAR, 
        PRIMARY KEY (id), 
        FOREIGN KEY(ncfile_id) REFERENCES ncfiles (id), 
        FOREIGN KEY(variable_id) REFERENCES variables (id)
        FOREIGN KEY(grid_id) REFERENCES grids (id)
);

This affects #216 as I coordinates could be identified by appearing in the grids/nccoords association table.

Populating the metadata in the grids table could be done from a yaml file in a similar way to the experiment metadata using the grid checksum to associate metadata with a grid. By default it could be the concatenation of all the coordinate long_name attributes (from the variable they reference).

It would be good to be able to create relationships between grids. The ocean data has a nominal rectilinear grid that is included in data files, but often has missing_value over land. If the full grid was available in the DB it would be good to be able to use that semi-automatically by defining them as being equivalent and preferring a grid which has no missing values, or assigned some higher quality factor.

There are issues with the ice data. The files as they come out of CICE do not have CF convention spatial coordinates. For example:

dimensions:
        ni = 3600 ;
        nj = 2700 ;
        time = UNLIMITED ; // (1 currently)
variables:
        float TLON(nj, ni) ;
                TLON:long_name = "T grid center longitude" ;
                TLON:units = "degrees_east" ;
                TLON:missing_value = 1.e+30f ;
                TLON:_FillValue = 1.e+30f ;
        float TLAT(nj, ni) ;
                TLAT:long_name = "T grid center latitude" ;
                TLAT:units = "degrees_north" ;
                TLAT:missing_value = 1.e+30f ;
                TLAT:_FillValue = 1.e+30f ;
        float ULON(nj, ni) ;
                ULON:long_name = "U grid center longitude" ;
                ULON:units = "degrees_east" ;
                ULON:missing_value = 1.e+30f ;
                ULON:_FillValue = 1.e+30f ;
        float ULAT(nj, ni) ;
                ULAT:long_name = "U grid center latitude" ;
                ULAT:units = "degrees_north" ;
                ULAT:missing_value = 1.e+30f ;
                ULAT:_FillValue = 1.e+30f ;
                ULAT:comment = "Latitude of NE corner of T grid cell" ;
        float aice(time, nj, ni) ;
                aice:units = "1" ;
                aice:long_name = "ice area  (aggregate)" ;
                aice:coordinates = "TLON TLAT time" ;
                aice:cell_measures = "area: tarea" ;
                aice:missing_value = 1.e+30f ;
                aice:_FillValue = 1.e+30f ;
                aice:cell_methods = "time: mean" ;

The aice variable just has nj and ni as the latitude and longitude dimensions, with no matching coordinate variable. It does have a coordinates attribute which could be used to infer the coordinates and therefore the grid. That attribute is removed by decode_cf=True so it is available in the encoding attribute:

> ds.aice.encoding
{'zlib': True,
 'shuffle': False,
 'complevel': 1,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (1, 675, 900),
 'source': '/g/data/ik11/outputs/access-om2-01/01deg_jra55v140_iaf/output200/ice/OUTPUT/iceh.2008-01-01.nc',
 'original_shape': (1, 2700, 3600),
 'dtype': dtype('float32'),
 'missing_value': 1e+30,
 '_FillValue': 1e+30,
 'coordinates': 'TLON TLAT time'}

It may be necessary to use this, as xarray usually lists both T and U grid coordinates in the .coords attribute, so this can't necessarily be relied upon.

The files as they come out of CICE do not have CF convention spatial coordinates

Just testing my understand of CF conventions, but is this true? ULAT and TLAT both satisfy requirements for a latitude coordinate (and same thing for ULON and TLON), by providing the units attribute. Because these coordinates depend on more than one spatiotemporal dimension, they fall under this paragraph of the coordinate systems section:

Any longitude, latitude, vertical or time coordinate which depends on more than one spatiotemporal dimension must be identified by the coordinates attribute of the data variable. The value of the coordinates attribute is a blank separated list of the names of auxiliary coordinate variables.

Indeed, the following paragraph indicates that the CICE behaviour may actually be useful:

We recommend that the name of a multidimensional coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

So I think it will have to be a combination of both approaches that you've suggested!

Oh right, excellent!

When referring to CF conventions I was thinking of the 1D case where dimension and coordinate are named the same, but didn't know about the convention for 2D coords.

I don't know if it is a problem that the MOM data comes with 1D coordinates (with the real 2D curvilinear coordinates in another file) and CICE defaults to 2D. The idea is to have a way to tell the DB that grids are equivalent.

The proposal above sort of assumed that a coordinate was 1D and a grid was made up of multiple 1D coordinates. If size were changed to shape and was then a python tuple then it would support 1D and 2D coordinates.

Was thinking about this after noticing the redundancy in storing dimensions/chunking in the DB (https://github.com/COSIMA/cosima-cookbook/issues/216#issuecomment-914822478).

So how about having a table storing dimensions:

CREATE TABLE dimensions (
        id INTEGER NOT NULL,
        name VARCHAR,
        size INTEGER,
        PRIMARY KEY (id),                            
);

and another for coordinates

CREATE TABLE coordinates (
        id INTEGER NOT NULL,
        name VARCHAR,
        dimension_id INTEGER NOT NULL,
        axis VARCHAR,
        checksum VARCHAR,
        PRIMARY KEY (id),                                      
        FOREIGN KEY(dimension_id) REFERENCES dimensions (id)                                         
);
CREATE UNIQUE INDEX ix_coordinates_name_dimension_id_checksum ON grids (name, dimension_id, checksum);

where axis is a spatial dimension type and use the CF Convention names and checksum is an md5 checksum of the entire coordinate variable.

The attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z or T which stand for a longitude, latitude, vertical, or time axis respectively.

Note: this is 1D dimensions only for illustrative purposes. nD dimensions (2D at least) should be supported for MOM's tripolar curvilinear grid. Not sure the best way to do that.

Note: including checksum will guarantee uniqueness, where there might be dimensions with the same name and size but differing axis points. Unfortunately it will also create false positives for small machine precision and round-off errors. Not sure how to deal with that.

Defining grids should be done to be as useful to us as possible, which means we can do it in a way that is helpful by assuming/asserting some structure. So define a grid to be, fundamentally, 2D. Depth and time coordinates are then an attribute of the variable. A 3D ocean variable with a time dimension shares the same "grid" as a 2D surface ocean variable. This is useful because you immediately know they can be used in calculations without any regridding being necessary. Also important and useful because it becomes obvious when variables defined on different grids, e.g. U vs T grids in MOM5.

So with that in mind can have a grid table that is just unique combinations of X and Y dimensions with an optional descriptive name

CREATE TABLE grids (
        id INTEGER NOT NULL,
        name VARCHAR,
        x_dimension_id INTEGER NOT NULL,
        y_dimension_id INTEGER NOT NULL,
        PRIMARY KEY (id),                                      
        FOREIGN KEY(x_dimension_id) REFERENCES dimensions (id) 
        FOREIGN KEY(y_dimension_id) REFERENCES dimensions (id)                                            
);
CREATE UNIQUE INDEX ix_grids_x_dimension_id_y_dimension_id ON grids (x_dimension_id, y_dimension_id);

Once you have named grids you have the possibility to generate and store remapping weights to make regridding easier and potentially almost transparent to users.

As above, an optional grid_id would replace dimensions in ncvars, and each ncvar also has the option of a Z and T dimension:

CREATE TABLE ncvars (
        id INTEGER NOT NULL, 
        ncfile_id INTEGER NOT NULL, 
        variable_id INTEGER NOT NULL, 
        grid_id INTEGER, 
        z_axis_id INTEGER,
        t_axis_id INTEGER,
        chunk_id INTEGER,
        PRIMARY KEY (id), 
        FOREIGN KEY(ncfile_id) REFERENCES ncfiles (id), 
        FOREIGN KEY(variable_id) REFERENCES variables (id)
        FOREIGN KEY(grid_id) REFERENCES grids (id)
        FOREIGN KEY(z_axis_id) REFERENCES coordinates (id)
        FOREIGN KEY(t_axis_id) REFERENCES coordinates (id)
        FOREIGN KEY(chunk_id) REFERENCES chunking (id)
);
CREATE UNIQUE INDEX ix_ncvars_ncfile_id_variable_id ON ncvars (ncfile_id, variable_id);

To avoid duplication of chunking information it also contains chunk_id which references a table of chunking tuples

CREATE TABLE chunking (
        id INTEGER NOT NULL, 
        chunks VARCHAR,
        PRIMARY KEY (id), 
);
CREATE UNIQUE INDEX ix_chunking_chunks ON chunking (chunks);

(I have liberally sprinkled indexes through this, but don't assume they are correct!)

The proposal to rigidly define grids has the downside that some variables (hopefully few) won't necessarily fit this model. Grid IDs aren't compulsory, but there is no fallback in that case. How that is handled is more about the logic in the explorer/API routines.

As well as uniquely identifying each grid, it would also be helpful to record provenance information, i.e. how each grid was generated. For example, a URL to a commit in a github repo of the specific grid generation script that was used.

COSIMA / cosima-cookbook

Uniquely identifying grids #191