Closed charles-turner-1 closed 1 month ago
I also think this might be the same issue as discussed in https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?
That was specifically for a project where we were using the intake catalogue as a source for an "experiment explorer", to expose the variables saved in an experiment in timeline to assist users in understanding what variables are available at different times in an experiment. For this purpose we really only wanted diagnostic model variables that have a time-varying component.
Add separate coordinate variable fields to the ACCESS-NRI Intake Catalog
I'm confused. Does this mean
BTW this is a somewhat related issue I think about encoding grid information:
https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/112
Yup, a coordinate flag, sorry for the confusion with terms.
As currently implemented, this would just add coordinates & coordinate relevant information to the catalogue - I've tried to keep changes to a minimum. We could think about having a separate catalogue specifically for coordinates but that feels potentially a bit harder for users to me?
I've attached a screencap with a demo of how the change affects catalogue results (N.B both are run using the same kernel, so cat_without_coords.search(coords='st_edges_ocean')
returns an empty search - running it on main would raise an error).
Also would be interested in knowing if this would cover your use cases @marc-white @anton-seaice
@charles-turner-1 my main concern to this point hasn't been how to search for the coordinates, but how to access them. In particular:
to_dask
, you get back a Dask array of the requested variables, plus the coordinates that the variable relates to. Would this work in reverse? So, if I search for a coordinate, and then converted to an array, would the Dask array include the coordinate, and then the related variables, and then the other coordinates that those variables rely on?intake-esm
crashing when it tries to pull out the coordinate data by stacking it on itself (in effect). How could we modify things to know to treat variables and coordinates differently in this regard?@marc-white In response to question 1, currently searching for a coordinate will load the entire dataset - it works very much like searching other metadata, eg.
>>> ocean_files = cat_with_coords.search(start_date='2086.*',frequency='1mon',file_id='ocean',).to_dask()
>>>print(ocean_files)
<xarray.Dataset> Size: 847GB
Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700,
xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
grid_xu_ocean: 3600, grid_yu_ocean: 2700,
grid_xt_ocean: 3600, neutral: 80, nv: 2,
st_edges_ocean: 76, sw_edges_ocean: 76,
potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
* xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
* yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
* st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
* st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
* time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
* nv (nv) float64 16B 1.0 2.0
... ...
* potrho (potrho) float64 640B 1.028e+03 ... 1.038e+03
* potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
* grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
* grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
* neutral (neutral) float64 640B 1.028e+03 ... 1.038e+03
* neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
salt (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
age_global (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
u (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
... ...
bih_fric_v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
u_dot_grad_vert_pv (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
average_T1 (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
average_T2 (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
average_DT (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
time_bounds (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
filename: ocean.nc
title: ACCESS-OM2-01
grid_type: mosaic
grid_tile: 1
intake_esm_vars: ['temp', 'pot_temp', 'salt', 'a...
intake_esm_attrs:filename: ocean.nc
... ...
intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ...
intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',...
intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met...
intake_esm_attrs:realm: ocean
intake_esm_attrs:_data_format_: netcdf
intake_esm_dataset_key: ocean.1mon
I like the suggestion that we only load
other coordinate variables those data variables depend on, though. I'll have a look into how straightforward that might be to implement.
>>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
>>> print(st_edges_search.df.head(3))
filename file_id path \
0 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
1 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
2 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
filename_timestamp frequency start_date end_date \
0 NaN 1mon 2086-01-01, 00:00:00 2086-04-01, 00:00:00
1 NaN 1mon 2086-04-01, 00:00:00 2086-07-01, 00:00:00
2 NaN 1mon 2086-07-01, 00:00:00 2086-10-01, 00:00:00
st_edges = st_edges_search.to_dask() print(st_edges)
Size: 847GB Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600, sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700, grid_xu_ocean: 3600, grid_yu_ocean: 2700, grid_xt_ocean: 3600, neutral: 80, nv: 2, st_edges_ocean: 76, sw_edges_ocean: 76, potrho_edges: 81, neutralrho_edges: 81) Coordinates: (12/18) * xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95 * yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98 * st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03 * st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03 * time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-... * nv (nv) float64 16B 1.0 2.0 ... ... * potrho (potrho) float64 640B 1.028e+03 ... 1.038e+03 * potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03 * grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95 * grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0 * neutral (neutral) float64 640B 1.028e+03 ... 1.038e+03 * neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.03... Data variables: (12/28) temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array salt (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array age_global (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array u (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array ... ... bih_fric_v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array u_dot_grad_vert_pv (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array average_T1 (time) datetime64[ns] 96B dask.array average_T2 (time) datetime64[ns] 96B dask.array average_DT (time) timedelta64[ns] 96B dask.array time_bounds (time, nv) timedelta64[ns] 192B dask.array Attributes: (12/22) filename: ocean.nc title: ACCESS-OM2-01 grid_type: mosaic grid_tile: 1 intake_esm_vars: ['temp', 'pot_temp', 'salt', 'a... intake_esm_attrs:filename: ocean.nc ... ... intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ... intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',... intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met... intake_esm_attrs:realm: ocean intake_esm_attrs:_data_format_: netcdf intake_esm_dataset_key: ocean.1mon ``` As before, searching for a variable will only load relevant data, eg: ```python pot_temp_search = cat_with_coords.search(variable='pot_temp',frequency='1mon',file_id='ocean',start_date = '2086.*') print(pot_temp_search.df.head(3)) filename file_id path \ 0 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55... 1 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55... 2 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...
filename_timestamp frequency start_date end_date \
0 NaN 1mon 2086-01-01, 00:00:00 2086-04-01, 00:00:00
1 NaN 1mon 2086-04-01, 00:00:00 2086-07-01, 00:00:00
2 NaN 1mon 2086-07-01, 00:00:00 2086-10-01, 00:00:00
...
pot_temp = pot_temp_search.to_dask() print(pot_temp)
Size: 35GB Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600) Coordinates: * xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 -279.7 ... 79.75 79.85 79.95 * yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 -81.02 ... 89.89 89.94 89.98 * st_ocean (st_ocean) float64 600B 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03 * time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-16 12:00:00 Data variables: pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array Attributes: (12/22) filename: ocean.nc title: ACCESS-OM2-01 grid_type: mosaic grid_tile: 1 intake_esm_vars: ['pot_temp'] intake_esm_attrs:filename: ocean.nc ... ... intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ... intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',... intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met... intake_esm_attrs:realm: ocean intake_esm_attrs:_data_format_: netcdf intake_esm_dataset_key: ocean.1mon ```
Instead of this:
st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*') st_edges = st_edges_search.to_dask()
We should just be able to do:
cat_with_coords.search(coords='st_edges_ocean').to_dask()
(possibly the file_id
part is still needed)
'st_edges_ocean' doesn't change over time, so we should check that all files have the same values, but having to supply the start_date = '2086.*'
is messy. And then calling to_dask, ideally would only return the coordinates (without the data_vars). e.g.
or
I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.
This makes sense to me, generally you are searching for a coordinate when trying to load them from a different file than where the variable is anyway, so you'll need a different search & .to_dask operation. e.g., you might want SST and area_t
and this is where I'm not sure they should have their own field:
e.g.
cat.search(variable='SST', coords='area_t')
would return 0 datasets ?
And we would have to do this
cat.search(variable='SST')
cat.search(coords='area_t')
to return two datasets then then need to be combined ?
but a different implementation would allow cat.search(variable=['area_t','SST])
to return the two datasets we want?
I spent a good chunk of time thinking about this yesterday afternoon. I've got a couple of thoughts:
cat.search(variable='SST', coords='area_t')
returns nothing.In a search like this:
cat.search(variable='SST')
cat.search(coords='area_t')
to return two datasets then then need to be combined ?but a different implementation would allow
cat.search(variable=['area_t','SST])
to return the two datasets we want?
we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:
sst, area = datasets.get('SST'), datasets.get('area_t')
which the user will have to merge anyway. So perhaps the file structure enforces a bit of unavoidable clunkiness in these situations?
The major difficulty (that I'm thinking about right now) is that adding coordinate variables to the variables we search leads to intake-esm telling xarray to request that variable when it loads the dataset. I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.
If not, I'm not sure that there's a straightforward solution - bundling coordinate variables in with data variables effectively loses the information that the two are different and need to be treated differently when we open files until we actually open the files.
TLDR; I'm gonna keep prodding - I agree that it would nice to be able to search for coordinates or data variables, rather than coordinates and.
- we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:
Ah yes - I think my point was just ... do we need an extra field, can we just put the coordinates in the variable field?
I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.
I think we need this for #117 anyway right ?
I see your point - I think if we can put the coordinates into the variable field then that would be preferable - it doesn't add any user complexity.
And yeah, look like #117 is the same issue. In fact, it looks like the cosima_cookbook n=1
argument @dougiesquire mentions there is the workaround that the cosima_cookbook used - which makes me suspect this problem is hiding in a few other places too.
Is your feature request related to a problem? Please describe.
Currently, the ACCESS-NRI Intake Catalog doesn't allow for searching of coordinate variables: for example, searching for
st_edges_ocean
will return 0 datasets. This can make searching for coordinate variables difficult, with 2 main pain points:If the coordinate variable is know to be stored in a netCDF file with specific data variables:
Although this doesn't require the user to (semi) manually work out what file to open, it's still messy as it requires passing round file names.
In other instances, coordinate variables are stored completely separately. For example,
ocean_grid.nc
files only contain coordinate variables, and so cannot be found using the catalogue. The only way to currently access these files is to search the catalogue to get a handle on the directory structure - and then construct a file path and load it: eg:This requires the user to start poking round in directory structures to try to work out where to load their data - which is the problem intake is trying to solve.
This has caused some pain points migrating COSIMA recipes from cosima_cookbook => intake.
I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?
Describe the feature you'd like
Searchable coordinates: in the same way that the catalog currently lets you perform searches over variables, it would be useful to be able to do the same on coordinates:
Doing this is subject to a couple of constraints:
xr.combine_by_coords
will fail if passed a coordinate variable.Proposed Solution
data_vars
=>variables
), as this would then confuse coordinates & variables in the ACCESS-NRI Intake Catalog as well as causing concatenation issues. This is implemented on branch 660-coordinate-variables.Additional Info
datastore.csv.gz
files writted bybuilder.save()
are typically approximately doubled.