csiro-coasts / emsarray

xarray extension that supports EMS model formats
BSD 3-Clause "New" or "Revised" License
13 stars 2 forks source link

Unconfigured coordinate variables for SHOC Standard Datasets #132

Closed sharon-tickell closed 4 months ago

sharon-tickell commented 5 months ago

If you open a SHOC standard NetCDF file with emsarray, the xarray.Dataset has properties like:

<xarray.Dataset>
Dimensions:              (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
                          j_centre: 180, i_centre: 600, j_left: 180,
                          i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
    z_centre             (k_centre) float64 ...
    y_grid               (j_grid, i_grid) float64 ...
    y_centre             (j_centre, i_centre) float64 ...
    y_left               (j_left, i_left) float64 ...
    y_back               (j_back, i_back) float64 ...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
                                i_centre, j_left, i_left, j_back, i_back, record
Data variables: (12/95)
    z_grid               (k_grid) float64 -4e+03 -3.78e+03 ... 1.0 1e+20
    x_grid               (j_grid, i_grid) float64 142.1 142.2 ... 156.9 156.9
    x_centre             (j_centre, i_centre) float64 142.2 142.2 ... 156.9
    x_left               (j_left, i_left) float64 142.2 142.2 ... 156.9 156.9
    x_back               (j_back, i_back) float64 142.2 142.2 ... 156.9 156.9
    botz                 (j_centre, i_centre) float64 99.0 99.0 ... -4e+03
    ...                   ...
    dens                 (record, k_centre, j_centre, i_centre) float32 0.0 ....
    omega_ar_expose      (record, k_centre, j_centre, i_centre) float32 3.182...
    omega_ar_ex_time     (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
    omega_ar_expose_3p5  (record, k_centre, j_centre, i_centre) float32 1.478...
    source               (record, k_centre, j_centre, i_centre) float32 0.0 ....
    Age                  (record, k_centre, j_centre, i_centre) float32 0.257...
Attributes: (12/18)
    title:                    GBR4_H2p0_B3p1_Cfur_Dnrt
    codehead:                 CSIRO Environmental Modelling Suite
    paramhead:                eReefs 4 km grid. Mile Furnas Catchment constan...
    description:              eReefs GBR 4k grid with rivers. Uses JCU bathy ...
    paramfile:                /datasets/work/ereefs/work/work/cem-worker/nrt_...
    ems_version:              v1.4.0 rev(6949)
    ...                       ...
    nce2:                     180
    nfe1:                     601
    nfe2:                     181
    ns3:                      2370250
    ns2:                      76327
    gridtype:                 NUMERICAL

Note that this includes quite a few dimensions which are not linked to coordinate variables, including the t variable which is actually the time-coordinate variable for the record dimension. Because that coordinate/dim combination is not configured, it is not possible to use xarray time coordinate slicing on a SHOC standard dataset.

It would be great if more (or all!) of the SHOC standard coordinate variables could be configured at https://github.com/csiro-coasts/emsarray/blob/7c1baf8c45380e6a2c8e85ab2ee7fe4650c8d9e2/src/emsarray/conventions/shoc.py#L44

sharon-tickell commented 5 months ago

Experiment 1 trying to find a workaround for the time dimension index:

time_coord = ds.ems.time_coordinate
time_dim = time_coord.dims[0]
if len(time_coord.indexes) == 0:
    ds = ds.set_coords(time_coord.name)
    ds = ds.set_index(indexes={time_dim: time_coord.name}, append=True)

this does make the record dimension have a coordinate index from t and allows time-coordinate slicing:

<xarray.Dataset>
Dimensions:              (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
                          j_centre: 180, i_centre: 600, j_left: 180,
                          i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
    z_centre             (k_centre) float64 ...
    y_grid               (j_grid, i_grid) float64 ...
    y_centre             (j_centre, i_centre) float64 ...
    y_left               (j_left, i_left) float64 ...
    y_back               (j_back, i_back) float64 ...
  * record               (record) datetime64[ns] 2022-07-30T14:00:00 ... 2022...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
                                i_centre, j_left, i_left, j_back, i_back
Data variables: (12/94)
    z_grid               (k_grid) float64 ...
    x_grid               (j_grid, i_grid) float64 ...
    x_centre             (j_centre, i_centre) float64 ...
    x_left               (j_left, i_left) float64 ...
    x_back               (j_back, i_back) float64 ...
    botz                 (j_centre, i_centre) float64 ...
    ...                   ...
    dens                 (record, k_centre, j_centre, i_centre) float32 ...
    omega_ar_expose      (record, k_centre, j_centre, i_centre) float32 ...
    omega_ar_ex_time     (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
    omega_ar_expose_3p5  (record, k_centre, j_centre, i_centre) float32 ...
    source               (record, k_centre, j_centre, i_centre) float32 ...
    Age                  (record, k_centre, j_centre, i_centre) float32 ...
Attributes: (12/18)
    title:                    GBR4_H2p0_B3p1_Cfur_Dnrt
    codehead:                 CSIRO Environmental Modelling Suite
    paramhead:                eReefs 4 km grid. Mile Furnas Catchment constan...
    description:              eReefs GBR 4k grid with rivers. Uses JCU bathy ...
    paramfile:                /datasets/work/ereefs/work/work/cem-worker/nrt_...
    ems_version:              v1.4.0 rev(6949)
    ...                       ...
    nce2:                     180
    nfe1:                     601
    nfe2:                     181
    ns3:                      2370250
    ns2:                      76327
    gridtype:                 NUMERICAL

But it also uses the dimension name as the coordinate name (they should be different) and removes t from the list of data variables. This subsequently makes calls to ds.ems.time_coordinate fail with an error like SHOC dataset did not have expected time coordinate 't'

sharon-tickell commented 5 months ago

Experiment 2:

time_coord = ds.ems.time_coordinate
time_dim = time_coord.dims[0]
if len(time_coord.indexes) == 0:
    ds = ds.assign_coords(coords={time_dim: ds[time_coord.name]})

This also sets up an indexed coordinate variable that uses the dimension name:

<xarray.Dataset>
Dimensions:              (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
                          j_centre: 180, i_centre: 600, j_left: 180,
                          i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
    z_centre             (k_centre) float64 ...
    y_grid               (j_grid, i_grid) float64 ...
    y_centre             (j_centre, i_centre) float64 ...
    y_left               (j_left, i_left) float64 ...
    y_back               (j_back, i_back) float64 ...
  * record               (record) datetime64[ns] 2022-07-30T14:00:00 ... 2022...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
                                i_centre, j_left, i_left, j_back, i_back
Data variables: (12/95)
    z_grid               (k_grid) float64 -4e+03 -3.78e+03 ... 1.0 1e+20
    x_grid               (j_grid, i_grid) float64 142.1 142.2 ... 156.9 156.9
    x_centre             (j_centre, i_centre) float64 142.2 142.2 ... 156.9
    x_left               (j_left, i_left) float64 142.2 142.2 ... 156.9 156.9
    x_back               (j_back, i_back) float64 142.2 142.2 ... 156.9 156.9
    botz                 (j_centre, i_centre) float64 99.0 99.0 ... -4e+03
    ...                   ...
    dens                 (record, k_centre, j_centre, i_centre) float32 0.0 ....
    omega_ar_expose      (record, k_centre, j_centre, i_centre) float32 3.182...
    omega_ar_ex_time     (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
    omega_ar_expose_3p5  (record, k_centre, j_centre, i_centre) float32 1.478...
    source               (record, k_centre, j_centre, i_centre) float32 0.0 ....
    Age                  (record, k_centre, j_centre, i_centre) float32 0.257...
Attributes: (12/18)
    title:                    GBR4_H2p0_B3p1_Cfur_Dnrt
    codehead:                 CSIRO Environmental Modelling Suite
    paramhead:                eReefs 4 km grid. Mile Furnas Catchment constan...
    description:              eReefs GBR 4k grid with rivers. Uses JCU bathy ...
    paramfile:                /datasets/work/ereefs/work/work/cem-worker/nrt_...
    ems_version:              v1.4.0 rev(6949)
    ...                       ...
    nce2:                     180
    nfe1:                     601
    nfe2:                     181
    ns3:                      2370250
    ns2:                      76327
    gridtype:                 NUMERICAL

BUT unlike the previous experiment, it doesn't remove the t data variable, so calls to ds.ems.time_coordinate still work. It's not quite right (I'm getting the impression that xarray can't cope very well with coordinate variables that have different names to their associated dimensions!), but it allows my use case to function well enough: they just have an extra 'record' coordinate variable floating about.

sharon-tickell commented 5 months ago

Possibly related? https://github.com/pydata/xarray/issues/4417

mx-moth commented 5 months ago

xarray handles indexing in a way that is not useful for SHOC standard datasets. Specifically it will only create indexes for coordinates whos variable name exactly matches the single dimension name it uses. Fixing this is (I think) part of https://github.com/pydata/xarray/projects/1, also see the discussion on the flexible indexing design document

As it stands, emsarray does not modify any datasets when they are opened. Adding an index on the time coordinate is useful, but doing so automatically poses some difficulties. As emsarray is exposed as an xarray accessor it isn't always involved in opening datasets. For instances where the dataset is opened through other means, emsarray is invoked lazily. This means there is no reliable way for emsarray to modify a dataset automatically in a way that will work consistently.

A new method such as dataset.ems.ensure_indexes() could be added that modifies a dataset in place to add all expected indexes. Users could call this method when they first open a dataset or receive the dataset from some other function, before calling any xarray methods that need indexing. I have written a similar function for another project:

def make_indexable(dataset: xarray.Dataset, name: Hashable) -> xarray.Dataset:
    """
    Ensure that `name` is an indexable coordinate in `dataset`.
    This is useful for adding indexes for coordinates whos name does not match the dimension name,
    e.g. a 'time' coordinate defined on the 'record' dimension.
    """
    if name not in dataset.coords:
        dataset = dataset.set_coords([name])
    if name not in dataset.indexes:
        dataset = dataset.set_xindex([name])
    return dataset

dataset = emsarray.open_dataset(...)
dataset = make_indexable(dataset, dataset.ems.get_time_name())
mx-moth commented 4 months ago

I am closing this issue, as the issue has little to do with emsarray itself. This has more to do with how xarray chooses what variables to automatically add indexes to. For the cases where xarray does not choose to add an index to a variable, the above comment has an applicable code example. The function uses no emsarray-specific functionality, and it is applicable to all dataset conventions where xarray may not add the desired indexes.