ACCESS-NRI / access-nri-intake-catalog

Tools and configuration info used to manage ACCESS-NRI's intake catalogue
https://access-nri-intake-catalog.rtfd.io
Apache License 2.0
8 stars 1 forks source link

Exclude (or flag) coordinates #63

Closed aidanheerdegen closed 2 weeks ago

aidanheerdegen commented 1 year ago

It would be good to make it possible to only show diagnostic variables in the catalogue, by excluding all coordinates.

This could be achieved by not including coordinate variables in the catalogue at all: they are accessible when variables are loaded that contain those coordinates.

Another option would be to include a column that with a flag that indicates if the variable is a coordinate, or not. In that way they could be filtered out by the user (I assume).

aidanheerdegen commented 1 year ago

It seems that there are variables that are listed as coordinates that still make it into the catalogue, and others that aren't classified as coordinates but we would also want to exclude.

e.g.

/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output002/ice/OUTPUT/iceh.1900-07.nc

In this catalogue:

/g/data/xp65/public/apps/access-nri-intake-catalog/v0.0.9/source/01deg_jra55v13_ryf9091.csv.gz   

that datafile contains the following variables:

['time_bounds', 'TLON', 'TLAT', 'ULON', 'ULAT', 'NCAT', 'tmask', 'blkmask', 'tarea', 'uarea', 'dxt', 'dyt', 'dxu', 'dyu', 'HTN', 'HTE', 'ANGLE', 'ANGLET']

Opening with xarray:

In [2]: ds = xr.open_dataset('/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output002/ice/OUTPUT/iceh.1900-07.nc')

In [3]: ds                                  
Out[3]:                                     
<xarray.Dataset>                            
Dimensions:       (time: 1, d2: 2, nj: 2700, ni: 3600, nc: 5)
Coordinates:                                    
  * time          (time) object 1900-08-01 00:00:00
    TLON          (nj, ni) float32 ...          
    TLAT          (nj, ni) float32 ...
    ULON          (nj, ni) float32 ...
    ULAT          (nj, ni) float32 ...
    NCAT          (nc) float32 ...
Dimensions without coordinates: d2, nj, ni, nc
Data variables: (12/49)
    time_bounds   (time, d2) object ...
    tmask         (nj, ni) float32 ...
    blkmask       (nj, ni) float32 ...
    tarea         (nj, ni) float32 ...
    uarea         (nj, ni) float32 ...
    dxt           (nj, ni) float32 ...
    ...            ...
    fmeltt_ai_m   (time, nj, ni) float32 ...
    opening_m     (time, nj, ni) float32 ...
    aicen_m       (time, nc, nj, ni) float32 ...
    vicen_m       (time, nc, nj, ni) float32 ...
    fmelttn_ai_m  (time, nc, nj, ni) float32 ...
    flatn_ai_m    (time, nc, nj, ni) float32 ...
Attributes:
    title:        sea ice model output for CICE
    contents:     Diagnostic and Prognostic Variables
    source:       Los Alamos Sea Ice Model (CICE) Version 5
    comment:      This Year Has 365 days
    comment2:     File written on model date 19000801
    comment3:     seconds elapsed into model date:      0
    conventions:  CF-1.0
    history:      This dataset was created on 2019-07-01 at 19:21:08.7
    io_flavor:    io_netcdf

TLON etc are correctly listed as coordinates by xarray:

In [4]: ds.coords                                                                                                    
Out[4]:                                                                                                            
Coordinates:                                                                                                         
  * time     (time) object 1900-08-01 00:00:00                                                                     
    TLON     (nj, ni) float32 ...                                                                                    
    TLAT     (nj, ni) float32 ...                                                                                  
    ULON     (nj, ni) float32 ...                                                             
    ULAT     (nj, ni) float32 ...                                                                                                  
    NCAT     (nc) float32 ...                                                                 

and not in data_vars

In [5]: ds.data_vars                                                                              
Out[5]:                                                                                           
Data variables:                                                                                   
    time_bounds   (time, d2) object ...                                                                               
    tmask         (nj, ni) float32 ...                                                            
    blkmask       (nj, ni) float32 ...                                                                                                                 
    tarea         (nj, ni) float32 ...              
    uarea         (nj, ni) float32 ...                                 
    dxt           (nj, ni) float32 ...                       
    dyt           (nj, ni) float32 ...                              
    dxu           (nj, ni) float32 ...      
    dyu           (nj, ni) float32 ...      
    HTN           (nj, ni) float32 ...      
    HTE           (nj, ni) float32 ...                                                                                        
    ANGLE         (nj, ni) float32 ...      
    ANGLET        (nj, ni) float32 ...      
    hi_m          (time, nj, ni) float32 ...
    hs_m          (time, nj, ni) float32 ...
    Tsfc_m        (time, nj, ni) float32 ...                 
    aice_m        (time, nj, ni) float32 ...    
    uvel_m        (time, nj, ni) float32 ...       
    vvel_m        (time, nj, ni) float32 ...    
    uatm_m        (time, nj, ni) float32 ...
    vatm_m        (time, nj, ni) float32 ...
    fswup_m       (time, nj, ni) float32 ...
    sst_m         (time, nj, ni) float32 ...
    sss_m         (time, nj, ni) float32 ...  
    uocn_m        (time, nj, ni) float32 ...
    vocn_m        (time, nj, ni) float32 ...
    alvdr_ai_m    (time, nj, ni) float32 ...
    alidr_ai_m    (time, nj, ni) float32 ...
    alvdf_ai_m    (time, nj, ni) float32 ...
    alidf_ai_m    (time, nj, ni) float32 ...
    congel_m      (time, nj, ni) float32 ...
    frazil_m      (time, nj, ni) float32 ...
    fsalt_m       (time, nj, ni) float32 ...
    fsalt_ai_m    (time, nj, ni) float32 ...
    strairx_m     (time, nj, ni) float32 ...    
    strairy_m     (time, nj, ni) float32 ...    
    strength_m    (time, nj, ni) float32 ...    
    divu_m        (time, nj, ni) float32 ...    
    shear_m       (time, nj, ni) float32 ...
    sig1_m        (time, nj, ni) float32 ...   
    sig2_m        (time, nj, ni) float32 ...         
    mlt_onset_m   (time, nj, ni) float32 ...               
    frz_onset_m   (time, nj, ni) float32 ...
    fmeltt_ai_m   (time, nj, ni) float32 ...         
    opening_m     (time, nj, ni) float32 ...             
    aicen_m       (time, nc, nj, ni) float32 ...
    vicen_m       (time, nc, nj, ni) float32 ...                      
    fmelttn_ai_m  (time, nc, nj, ni) float32 ...
    flatn_ai_m    (time, nc, nj, ni) float32 ...

So not sure why TLON etc are in the catalogue. It is version 0.0.9, should that exclude coordinates?

It might be worth using cf_xarray to identify variables

In [10]: ds.cf
Out[10]: 
Coordinates:
             CF Axes:   X, Y, Z, T: n/a

      CF Coordinates:   longitude: ['TLON', 'ULON']
                        latitude: ['TLAT', 'ULAT']
                        vertical, time: n/a

       Cell Measures:   area, volume: n/a

      Standard Names:   n/a

              Bounds:   n/a

       Grid Mappings:   n/a

Data Variables:
       Cell Measures:   area: ['tarea', 'uarea']
                        volume: n/a

      Standard Names:   n/a

              Bounds:   time: ['time_bounds']

       Grid Mappings:   n/a

to exclude bounds variables

In [11]: ds.cf.bounds
Out[11]: {'time': ['time_bounds']}

or even extract information to augment grid identification (#112)

In [12]: ds.cf.cell_measures
Out[12]: {'area': ['tarea', 'uarea']}
dougiesquire commented 1 year ago

Thanks for reporting @aidanheerdegen. At the moment, only 1D coordinates are excluded since files are opened with decode_cf=False and decode_coords=False. This is why TLON etc are still in the Intake-ESM datastores.

We can obviously easily switch these flags to True, which in the case of your example file would then also exclude TLON, TLAT, ULON, ULAT and NCAT. This will add some overhead, but may need to be done anyway for extracting grid information.

Thinking more about this, I'm not sure we want to exclude coordinates, since I think it is useful to be able to search on these. In ACCESS-OM2 output, for example, grid information is included in a separate ocean_grid.nc file that users might want to search for based on coordinate names. So flagging may be a better approach after all.

and others that aren't classified as coordinates but we would also want to exclude.

I don't think I understand how you're suggesting we flag these in a robust way. In fact, it's not even clear to me what's in this list. I think you're referring to properties of the grid (e.g. dxt, dyt etc)? Does this also include areas? Is it things that aren't dependent on time?

dougiesquire commented 1 year ago

So flagging may be a better approach after all.

It just occurred to me that flagging coordinates is going to be difficult with the way Intake-ESM is currently set up to handle multi-variable assets (files). Multi-variable assets are included as a single row in an Intake-ESM datastore, with a column containing a list of the variables available in that file - see here. There's thus no easy way include per-variable attributes in the table.

aidanheerdegen commented 1 year ago

Ok, thanks for the informative response @dougiesquire.

I agree, I think I was being a bit ... cavalier ... with my wish to filter out a large number of variables as "griddy". There are indeed good use cases for many of these.

For my use case I think we'll do what the COSIMA Explorer does, use some heuristics to "hide" variables that are potentially of little use to expose at a high level, but allow for them to be shown if a user wishes.

Regarding extracting grid information that actually has the potential to be a useful way to exclude variables: if they exist in a higher-quality grid file, then use that and "hide" the same variables that are present in the data file. I am thinking specifically of un-masked grid coordinates. That may require more contemplation, so I'll just leave that thought bubble there and back away slowly ...