COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
58 stars 25 forks source link

Identify coordinates in database #216

Open aidanheerdegen opened 4 years ago

aidanheerdegen commented 4 years ago

The explorer uses a crude heuristic to identify coordinates so they can be hidden from view when selecting variables to load.

It would be better to identify coordinates from the metadata available when scanning data files and save this as a boolean in the ncvars table.

xarray.Dataset identifies coordinates in the .coords attribute of a dataset, which is the minimum for identifying coordinate variables.

There are other variables that are not classified by xarray as coordinates, but which should be. For example in the MOM outputs average_T1, average_T2, average_DT and time_bounds contain ancillary coordinate data and should be flagged as coordinates.

Bounds variables can be identified by appearing in the bounds attribute of another variable. This is done in splitvar:

https://github.com/coecms/splitvar/blob/master/splitvar/utils.py#L75-L90

The other variables listed above are also present in the attribute of another variable, e.g

        float salt(time, st_ocean, yt_ocean, xt_ocean) ;
                salt:long_name = "Practical Salinity" ;
                salt:units = "psu" ;
                salt:valid_range = -10.f, 100.f ;
                salt:missing_value = -1.e+20f ;
                salt:_FillValue = -1.e+20f ;
                salt:cell_methods = "time: mean" ;
                salt:time_avg_info = "average_T1,average_T2,average_DT" ;
                salt:coordinates = "geolon_t geolat_t" ;
                salt:standard_name = "sea_water_salinity" ;

they can be identified by adapting the code from splitvar which is trying to find all variables that another variable "depends on" https://github.com/coecms/splitvar/blob/master/splitvar/splitvar.py#L226-L248

For the ice data variables like TLON should be flagged as coordinates, and the logic above would also work, as they are listed as coordinates attributes for other variables:

        float hi(time, nj, ni) ;
                hi:units = "m" ;
                hi:long_name = "grid cell mean ice thickness" ;
                hi:coordinates = "TLON TLAT time" ;
                hi:cell_measures = "area: tarea" ;
                hi:missing_value = 1.e+30f ;
                hi:_FillValue = 1.e+30f ;
                hi:cell_methods = "time: mean" ;
                hi:time_rep = "averaged" ;

It might be that the coordinates attribute should be a special case that is specifically searched for.

aidanheerdegen commented 4 years ago

If proposal in #191 were taken up this would overlap this significantly

aidanheerdegen commented 3 years ago

There is a lot of redundancy in recording the same dimensions/chunking in every NCFile

$ sqlite3 /g/data/ik11/databases/cosima_master.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> select count(*) from (select dimensions, chunking from ncvars) t;
8872212
sqlite> select count(*) from (select distinct dimensions, chunking from ncvars) t;
307
sqlite> 

I think this plays into #191 : coordinates don't (generally, WRF is an exception) change with time. So it makes sense to store them in separate tables, give each a unique "grid" id and just associate the grid id with a variable.

This isn't so much schema breaking as schema exploding.

Note: we're not storing the actual size of dimensions currently. So a separate dimensions table make sense. cf-xarray could be used to add X/lon, Y/lat, Z/depth categorisation. So grids could be 2D/3D based on distinct dimension id tuples.