Open aidanheerdegen opened 4 years ago
If proposal in #191 were taken up this would overlap this significantly
There is a lot of redundancy in recording the same dimensions/chunking in every NCFile
$ sqlite3 /g/data/ik11/databases/cosima_master.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> select count(*) from (select dimensions, chunking from ncvars) t;
8872212
sqlite> select count(*) from (select distinct dimensions, chunking from ncvars) t;
307
sqlite>
I think this plays into #191 : coordinates don't (generally, WRF is an exception) change with time. So it makes sense to store them in separate tables, give each a unique "grid" id and just associate the grid id with a variable.
This isn't so much schema breaking as schema exploding.
Note: we're not storing the actual size of dimensions currently. So a separate dimensions table make sense. cf-xarray
could be used to add X/lon
, Y/lat
, Z/depth
categorisation. So grids could be 2D/3D based on distinct
dimension id tuples.
The explorer uses a crude heuristic to identify coordinates so they can be hidden from view when selecting variables to load.
It would be better to identify coordinates from the metadata available when scanning data files and save this as a boolean in the
ncvars
table.xarray.Dataset
identifies coordinates in the.coords
attribute of a dataset, which is the minimum for identifying coordinate variables.There are other variables that are not classified by
xarray
as coordinates, but which should be. For example in the MOM outputsaverage_T1
,average_T2
,average_DT
andtime_bounds
contain ancillary coordinate data and should be flagged as coordinates.Bounds variables can be identified by appearing in the bounds attribute of another variable. This is done in
splitvar
:https://github.com/coecms/splitvar/blob/master/splitvar/utils.py#L75-L90
The other variables listed above are also present in the attribute of another variable, e.g
they can be identified by adapting the code from
splitvar
which is trying to find all variables that another variable "depends on" https://github.com/coecms/splitvar/blob/master/splitvar/splitvar.py#L226-L248For the ice data variables like
TLON
should be flagged as coordinates, and the logic above would also work, as they are listed ascoordinates
attributes for other variables:It might be that the
coordinates
attribute should be a special case that is specifically searched for.