ACCESS-NRI / access-nri-intake-catalog

Tools and configuration info used to manage ACCESS-NRI's intake catalogue
https://access-nri-intake-catalog.rtfd.io
Apache License 2.0
8 stars 1 forks source link

Making coordinate variables searchable #201

Closed charles-turner-1 closed 1 month ago

charles-turner-1 commented 1 month ago

Is your feature request related to a problem? Please describe.

Currently, the ACCESS-NRI Intake Catalog doesn't allow for searching of coordinate variables: for example, searching for st_edges_ocean will return 0 datasets. This can make searching for coordinate variables difficult, with 2 main pain points:

  1. If the coordinate variable is know to be stored in a netCDF file with specific data variables:

    • The user needs to know that these data & coordinate variables are found in the same files, and then search for the data variable in order to access the coordinate variables.
    • In some instances, the user can directly access the coordinate variables by searching the data variables. In others, they need to perform something like
      VARNAME = ...
      COORD = ...
      fname = cat.search(variable=VARNAME).df.loc[0,'file']
      ds = xr.open_dataset(fname)
      coord = ds[COORD]

      Although this doesn't require the user to (semi) manually work out what file to open, it's still messy as it requires passing round file names.

  2. In other instances, coordinate variables are stored completely separately. For example, ocean_grid.nc files only contain coordinate variables, and so cannot be found using the catalogue. The only way to currently access these files is to search the catalogue to get a handle on the directory structure - and then construct a file path and load it: eg:

    
        fname = cat.search(variable=VARNAME).df.loc[0,'file']
        dirname = Path(fname).parent
        grid_file, = [file for file in os.listdir(dirname) if file.endswith('_grid.nc')] 
        ds = xr.open_dataset(grid_file)
        coord = ds[COORD]

    This requires the user to start poking round in directory structures to try to work out where to load their data - which is the problem intake is trying to solve.

This has caused some pain points migrating COSIMA recipes from cosima_cookbook => intake.

I also think this might be the same issue as discussed in #63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

Describe the feature you'd like

Searchable coordinates: in the same way that the catalog currently lets you perform searches over variables, it would be useful to be able to do the same on coordinates:

var = cat.search(variable=VARN).to_dask()
coord = cat.search(coord=COORD).to_dask()

Doing this is subject to a couple of constraints:

  1. The catalog needs to know that coordinates & data variables aren't the same & need to be treated differently - xr.combine_by_coords will fail if passed a coordinate variable.
  2. This requires an upstream patch of intake-esm - see issue 660.
  3. Cannot cause serious performance regressions - see above issue.

Proposed Solution

  1. Update intake-esm with @dougiesquire's proposed solution from issue 660.
  2. Add separate coordinate coordinate variable fields to the ACCESS-NRI Intake Catalog, rather than just making the same change as in Intake-ESM (data_vars => variables), as this would then confuse coordinates & variables in the ACCESS-NRI Intake Catalog as well as causing concatenation issues. This is implemented on branch 660-coordinate-variables.

Additional Info

aidanheerdegen commented 1 month ago

I also think this might be the same issue as discussed in https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/63? @aidanheerdegen - seem to be some concerns about coordinates being listed as variables when they shouldn't be there?

That was specifically for a project where we were using the intake catalogue as a source for an "experiment explorer", to expose the variables saved in an experiment in timeline to assist users in understanding what variables are available at different times in an experiment. For this purpose we really only wanted diagnostic model variables that have a time-varying component.

Add separate coordinate variable fields to the ACCESS-NRI Intake Catalog

I'm confused. Does this mean

  1. Have a "coordinate" flag (field)?
  2. Move coordinates into a separate catalogue? or neither?

BTW this is a somewhat related issue I think about encoding grid information:

https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/112

charles-turner-1 commented 1 month ago

Yup, a coordinate flag, sorry for the confusion with terms.

As currently implemented, this would just add coordinates & coordinate relevant information to the catalogue - I've tried to keep changes to a minimum. We could think about having a separate catalogue specifically for coordinates but that feels potentially a bit harder for users to me?

I've attached a screencap with a demo of how the change affects catalogue results (N.B both are run using the same kernel, so cat_without_coords.search(coords='st_edges_ocean') returns an empty search - running it on main would raise an error).

Screenshot 2024-09-30 at 12 54 51 PM
charles-turner-1 commented 1 month ago

Also would be interested in knowing if this would cover your use cases @marc-white @anton-seaice

marc-white commented 1 month ago

@charles-turner-1 my main concern to this point hasn't been how to search for the coordinates, but how to access them. In particular:

charles-turner-1 commented 1 month ago

@marc-white In response to question 1, currently searching for a coordinate will load the entire dataset - it works very much like searching other metadata, eg.

>>> ocean_files = cat_with_coords.search(start_date='2086.*',frequency='1mon',file_id='ocean',).to_dask()
>>>print(ocean_files)
<xarray.Dataset> Size: 847GB
Dimensions:                (time: 12, st_ocean: 75, yt_ocean: 2700,
                            xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600,
                            sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700,
                            grid_xu_ocean: 3600, grid_yu_ocean: 2700,
                            grid_xt_ocean: 3600, neutral: 80, nv: 2,
                            st_edges_ocean: 76, sw_edges_ocean: 76,
                            potrho_edges: 81, neutralrho_edges: 81)
Coordinates: (12/18)
  * xt_ocean               (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95
  * yt_ocean               (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98
  * st_ocean               (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03
  * st_edges_ocean         (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03
  * time                   (time) object 96B 2086-01-16 12:00:00 ... 2086-12-...
  * nv                     (nv) float64 16B 1.0 2.0
    ...                     ...
  * potrho                 (potrho) float64 640B 1.028e+03 ... 1.038e+03
  * potrho_edges           (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03
  * grid_xt_ocean          (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95
  * grid_yu_ocean          (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0
  * neutral                (neutral) float64 640B 1.028e+03 ... 1.038e+03
  * neutralrho_edges       (neutralrho_edges) float64 648B 1.028e+03 ... 1.03...
Data variables: (12/28)
    temp                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    pot_temp               (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    salt                   (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    age_global             (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    v                      (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    ...                     ...
    bih_fric_v             (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    u_dot_grad_vert_pv     (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array<chunksize=(1, 7, 300, 400), meta=np.ndarray>
    average_T1             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_T2             (time) datetime64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    average_DT             (time) timedelta64[ns] 96B dask.array<chunksize=(3,), meta=np.ndarray>
    time_bounds            (time, nv) timedelta64[ns] 192B dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/22)
    filename:                                 ocean.nc
    title:                                    ACCESS-OM2-01
    grid_type:                                mosaic
    grid_tile:                                1
    intake_esm_vars:                          ['temp', 'pot_temp', 'salt', 'a...
    intake_esm_attrs:filename:                ocean.nc
    ...                                       ...
    intake_esm_attrs:coord_calendar_types:    ['', '', '', '', 'NOLEAP', '', ...
    intake_esm_attrs:coord_bounds:            ['', '', '', '', 'time_bounds',...
    intake_esm_attrs:coord_units:             ['degrees_E', 'degrees_N', 'met...
    intake_esm_attrs:realm:                   ocean
    intake_esm_attrs:_data_format_:           netcdf
    intake_esm_dataset_key:                   ocean.1mon

I like the suggestion that we only load

  1. data variables that depend on the coordinate variable searched for, and
  2. other coordinate variables those data variables depend on, though. I'll have a look into how straightforward that might be to implement.

    • With regards to the second question, searching for a coordinate & obtaining a multi-file dataset works fine right now, so I think this approach has sidestepped the crashing issue:
      
      >>> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*')
      >>> print(st_edges_search.df.head(3))
      filename file_id                                               path  \
      0  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
      1  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   
      2  ocean.nc   ocean  /g/data/ik11/outputs/access-om2-01/01deg_jra55...   

    filename_timestamp frequency start_date end_date \ 0 NaN 1mon 2086-01-01, 00:00:00 2086-04-01, 00:00:00
    1 NaN 1mon 2086-04-01, 00:00:00 2086-07-01, 00:00:00
    2 NaN 1mon 2086-07-01, 00:00:00 2086-10-01, 00:00:00

st_edges = st_edges_search.to_dask() print(st_edges)

Size: 847GB Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600, yu_ocean: 2700, xu_ocean: 3600, sw_ocean: 75, potrho: 80, grid_yt_ocean: 2700, grid_xu_ocean: 3600, grid_yu_ocean: 2700, grid_xt_ocean: 3600, neutral: 80, nv: 2, st_edges_ocean: 76, sw_edges_ocean: 76, potrho_edges: 81, neutralrho_edges: 81) Coordinates: (12/18) * xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95 * yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98 * st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03 * st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03 * time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-... * nv (nv) float64 16B 1.0 2.0 ... ... * potrho (potrho) float64 640B 1.028e+03 ... 1.038e+03 * potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03 * grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.95 * grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 90.0 * neutral (neutral) float64 640B 1.028e+03 ... 1.038e+03 * neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.03... Data variables: (12/28) temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array salt (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array age_global (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array u (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array ... ... bih_fric_v (time, st_ocean, yu_ocean, xu_ocean) float32 35GB dask.array u_dot_grad_vert_pv (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array average_T1 (time) datetime64[ns] 96B dask.array average_T2 (time) datetime64[ns] 96B dask.array average_DT (time) timedelta64[ns] 96B dask.array time_bounds (time, nv) timedelta64[ns] 192B dask.array Attributes: (12/22) filename: ocean.nc title: ACCESS-OM2-01 grid_type: mosaic grid_tile: 1 intake_esm_vars: ['temp', 'pot_temp', 'salt', 'a... intake_esm_attrs:filename: ocean.nc ... ... intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ... intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',... intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met... intake_esm_attrs:realm: ocean intake_esm_attrs:_data_format_: netcdf intake_esm_dataset_key: ocean.1mon ``` As before, searching for a variable will only load relevant data, eg: ```python pot_temp_search = cat_with_coords.search(variable='pot_temp',frequency='1mon',file_id='ocean',start_date = '2086.*') print(pot_temp_search.df.head(3)) filename file_id path \ 0 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55... 1 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55... 2 ocean.nc ocean /g/data/ik11/outputs/access-om2-01/01deg_jra55...

filename_timestamp frequency start_date end_date \ 0 NaN 1mon 2086-01-01, 00:00:00 2086-04-01, 00:00:00
1 NaN 1mon 2086-04-01, 00:00:00 2086-07-01, 00:00:00
2 NaN 1mon 2086-07-01, 00:00:00 2086-10-01, 00:00:00
...

pot_temp = pot_temp_search.to_dask() print(pot_temp)

Size: 35GB Dimensions: (time: 12, st_ocean: 75, yt_ocean: 2700, xt_ocean: 3600) Coordinates: * xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 -279.7 ... 79.75 79.85 79.95 * yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 -81.02 ... 89.89 89.94 89.98 * st_ocean (st_ocean) float64 600B 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03 * time (time) object 96B 2086-01-16 12:00:00 ... 2086-12-16 12:00:00 Data variables: pot_temp (time, st_ocean, yt_ocean, xt_ocean) float32 35GB dask.array Attributes: (12/22) filename: ocean.nc title: ACCESS-OM2-01 grid_type: mosaic grid_tile: 1 intake_esm_vars: ['pot_temp'] intake_esm_attrs:filename: ocean.nc ... ... intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ... intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds',... intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'met... intake_esm_attrs:realm: ocean intake_esm_attrs:_data_format_: netcdf intake_esm_dataset_key: ocean.1mon ```
anton-seaice commented 1 month ago

Instead of this:

st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*') st_edges = st_edges_search.to_dask()

We should just be able to do:

cat_with_coords.search(coords='st_edges_ocean').to_dask() (possibly the file_id part is still needed)

'st_edges_ocean' doesn't change over time, so we should check that all files have the same values, but having to supply the start_date = '2086.*' is messy. And then calling to_dask, ideally would only return the coordinates (without the data_vars). e.g.

Screenshot 2024-10-01 at 11 11 36 AM

or

Screenshot 2024-10-01 at 11 12 30 AM
charles-turner-1 commented 1 month ago
Size: 229kB Dimensions: (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75, st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600, yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76, grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80, potrho_edges: 81, grid_xt_ocean: 3600, grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81) Coordinates: (12/18) * xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95 * yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98 * st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03 * st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03 * time (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1... * nv (nv) float64 16B 1.0 2.0 ... ... * potrho (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03 * potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03 * grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95 * grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0 * neutral (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03 * neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03 Data variables: *empty* Attributes: (12/16) filename: ocean.nc title: ACCESS-OM2-01 grid_type: mosaic grid_tile: 1 intake_esm_attrs:filename: ocean.nc intake_esm_attrs:file_id: ocean ... ... intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ''... intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds', '... intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'meter... intake_esm_attrs:realm: ocean intake_esm_attrs:_data_format_: netcdf intake_esm_dataset_key: ocean.1monI've looked into it, & I think adding data_vars that depend on the coordinate vars we've searched is going to 1. Probably require modifications to [ecgtools](https://github.com/ncar-xdev/ecgtools/). 2. Potentially cause memory usage blowups. Subsetting down to just a few files by providing a year etc, eg. ```python st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean',start_date = '2086.*') st_edges = st_edges_search.to_dask() ``` is messy, I agree - in the search above a large ARE instance will crash with default chunking - seems to be the result of Dask trying to set one chunk per file. Subsetting down to just a few files (`start_date = '2086.*'`) resolves this. I'd rather not change default chunking when we perform coordinate searches - it seems messy and error prone - depending on the coordinate we might load most of the data variables in the file. With the current implementation, we can easily search & load the dataset as follows: ```python >> st_edges_search = cat_with_coords.search(coords='st_edges_ocean',frequency='1mon',file_id='ocean') >>> varnames = st_edges_search.df.loc[0,'variable'] >>> print(st_edges_search.to_dask(xarray_open_kwargs = {'drop_variables' : varnames})) Size: 229kB Dimensions: (xt_ocean: 3600, yt_ocean: 2700, st_ocean: 75, st_edges_ocean: 76, time: 2760, nv: 2, xu_ocean: 3600, yu_ocean: 2700, sw_ocean: 75, sw_edges_ocean: 76, grid_xu_ocean: 3600, grid_yt_ocean: 2700, potrho: 80, potrho_edges: 81, grid_xt_ocean: 3600, grid_yu_ocean: 2700, neutral: 80, neutralrho_edges: 81) Coordinates: (12/18) * xt_ocean (xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95 * yt_ocean (yt_ocean) float64 22kB -81.11 -81.07 ... 89.94 89.98 * st_ocean (st_ocean) float64 600B 0.5413 1.681 ... 5.709e+03 * st_edges_ocean (st_edges_ocean) float64 608B 0.0 1.083 ... 5.809e+03 * time (time) object 22kB 1950-01-16 12:00:00 ... 2179-12-16 1... * nv (nv) float64 16B 1.0 2.0 ... ... * potrho (potrho) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03 * potrho_edges (potrho_edges) float64 648B 1.028e+03 ... 1.038e+03 * grid_xt_ocean (grid_xt_ocean) float64 29kB -279.9 -279.8 ... 79.85 79.95 * grid_yu_ocean (grid_yu_ocean) float64 22kB -81.09 -81.05 ... 89.96 90.0 * neutral (neutral) float64 640B 1.028e+03 1.028e+03 ... 1.038e+03 * neutralrho_edges (neutralrho_edges) float64 648B 1.028e+03 ... 1.038e+03 Data variables: *empty* Attributes: (12/16) filename: ocean.nc title: ACCESS-OM2-01 grid_type: mosaic grid_tile: 1 intake_esm_attrs:filename: ocean.nc intake_esm_attrs:file_id: ocean ... ... intake_esm_attrs:coord_calendar_types: ['', '', '', '', 'NOLEAP', '', ''... intake_esm_attrs:coord_bounds: ['', '', '', '', 'time_bounds', '... intake_esm_attrs:coord_units: ['degrees_E', 'degrees_N', 'meter... intake_esm_attrs:realm: ocean intake_esm_attrs:_data_format_: netcdf intake_esm_dataset_key: ocean.1mon ``` I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.
anton-seaice commented 1 month ago

I think the most straightforward solution here is going to be to get searches on coordinates to drop all data variables by default.

This makes sense to me, generally you are searching for a coordinate when trying to load them from a different file than where the variable is anyway, so you'll need a different search & .to_dask operation. e.g., you might want SST and area_t

and this is where I'm not sure they should have their own field:

e.g. cat.search(variable='SST', coords='area_t')

would return 0 datasets ?

And we would have to do this

cat.search(variable='SST') cat.search(coords='area_t')

to return two datasets then then need to be combined ?

but a different implementation would allow cat.search(variable=['area_t','SST]) to return the two datasets we want?

charles-turner-1 commented 1 month ago

I spent a good chunk of time thinking about this yesterday afternoon. I've got a couple of thoughts:

The major difficulty (that I'm thinking about right now) is that adding coordinate variables to the variables we search leads to intake-esm telling xarray to request that variable when it loads the dataset. I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.

If not, I'm not sure that there's a straightforward solution - bundling coordinate variables in with data variables effectively loses the information that the two are different and need to be treated differently when we open files until we actually open the files.

TLDR; I'm gonna keep prodding - I agree that it would nice to be able to search for coordinates or data variables, rather than coordinates and.

anton-seaice commented 1 month ago
  • we're going to obtain two datasets from the search anyway, right? And so downstream of the search we're going necessarily have something like:

Ah yes - I think my point was just ... do we need an extra field, can we just put the coordinates in the variable field?

I'm looking into whether we can add a step to check whether variables are data or coordinate variables before intake-esm actually opens and begins to concatenate the datasets.

I think we need this for #117 anyway right ?

charles-turner-1 commented 1 month ago

I see your point - I think if we can put the coordinates into the variable field then that would be preferable - it doesn't add any user complexity.

And yeah, look like #117 is the same issue. In fact, it looks like the cosima_cookbook n=1 argument @dougiesquire mentions there is the workaround that the cosima_cookbook used - which makes me suspect this problem is hiding in a few other places too.