Closed sharon-tickell closed 4 months ago
Experiment 1 trying to find a workaround for the time dimension index:
time_coord = ds.ems.time_coordinate
time_dim = time_coord.dims[0]
if len(time_coord.indexes) == 0:
ds = ds.set_coords(time_coord.name)
ds = ds.set_index(indexes={time_dim: time_coord.name}, append=True)
this does make the record
dimension have a coordinate index from t
and allows time-coordinate slicing:
<xarray.Dataset>
Dimensions: (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
j_centre: 180, i_centre: 600, j_left: 180,
i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
z_centre (k_centre) float64 ...
y_grid (j_grid, i_grid) float64 ...
y_centre (j_centre, i_centre) float64 ...
y_left (j_left, i_left) float64 ...
y_back (j_back, i_back) float64 ...
* record (record) datetime64[ns] 2022-07-30T14:00:00 ... 2022...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
i_centre, j_left, i_left, j_back, i_back
Data variables: (12/94)
z_grid (k_grid) float64 ...
x_grid (j_grid, i_grid) float64 ...
x_centre (j_centre, i_centre) float64 ...
x_left (j_left, i_left) float64 ...
x_back (j_back, i_back) float64 ...
botz (j_centre, i_centre) float64 ...
... ...
dens (record, k_centre, j_centre, i_centre) float32 ...
omega_ar_expose (record, k_centre, j_centre, i_centre) float32 ...
omega_ar_ex_time (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
omega_ar_expose_3p5 (record, k_centre, j_centre, i_centre) float32 ...
source (record, k_centre, j_centre, i_centre) float32 ...
Age (record, k_centre, j_centre, i_centre) float32 ...
Attributes: (12/18)
title: GBR4_H2p0_B3p1_Cfur_Dnrt
codehead: CSIRO Environmental Modelling Suite
paramhead: eReefs 4 km grid. Mile Furnas Catchment constan...
description: eReefs GBR 4k grid with rivers. Uses JCU bathy ...
paramfile: /datasets/work/ereefs/work/work/cem-worker/nrt_...
ems_version: v1.4.0 rev(6949)
... ...
nce2: 180
nfe1: 601
nfe2: 181
ns3: 2370250
ns2: 76327
gridtype: NUMERICAL
But it also uses the dimension name as the coordinate name (they should be different) and removes t
from the list of data variables. This subsequently makes calls to ds.ems.time_coordinate
fail with an error like SHOC dataset did not have expected time coordinate 't'
Experiment 2:
time_coord = ds.ems.time_coordinate
time_dim = time_coord.dims[0]
if len(time_coord.indexes) == 0:
ds = ds.assign_coords(coords={time_dim: ds[time_coord.name]})
This also sets up an indexed coordinate variable that uses the dimension name:
<xarray.Dataset>
Dimensions: (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
j_centre: 180, i_centre: 600, j_left: 180,
i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
z_centre (k_centre) float64 ...
y_grid (j_grid, i_grid) float64 ...
y_centre (j_centre, i_centre) float64 ...
y_left (j_left, i_left) float64 ...
y_back (j_back, i_back) float64 ...
* record (record) datetime64[ns] 2022-07-30T14:00:00 ... 2022...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
i_centre, j_left, i_left, j_back, i_back
Data variables: (12/95)
z_grid (k_grid) float64 -4e+03 -3.78e+03 ... 1.0 1e+20
x_grid (j_grid, i_grid) float64 142.1 142.2 ... 156.9 156.9
x_centre (j_centre, i_centre) float64 142.2 142.2 ... 156.9
x_left (j_left, i_left) float64 142.2 142.2 ... 156.9 156.9
x_back (j_back, i_back) float64 142.2 142.2 ... 156.9 156.9
botz (j_centre, i_centre) float64 99.0 99.0 ... -4e+03
... ...
dens (record, k_centre, j_centre, i_centre) float32 0.0 ....
omega_ar_expose (record, k_centre, j_centre, i_centre) float32 3.182...
omega_ar_ex_time (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
omega_ar_expose_3p5 (record, k_centre, j_centre, i_centre) float32 1.478...
source (record, k_centre, j_centre, i_centre) float32 0.0 ....
Age (record, k_centre, j_centre, i_centre) float32 0.257...
Attributes: (12/18)
title: GBR4_H2p0_B3p1_Cfur_Dnrt
codehead: CSIRO Environmental Modelling Suite
paramhead: eReefs 4 km grid. Mile Furnas Catchment constan...
description: eReefs GBR 4k grid with rivers. Uses JCU bathy ...
paramfile: /datasets/work/ereefs/work/work/cem-worker/nrt_...
ems_version: v1.4.0 rev(6949)
... ...
nce2: 180
nfe1: 601
nfe2: 181
ns3: 2370250
ns2: 76327
gridtype: NUMERICAL
BUT unlike the previous experiment, it doesn't remove the t
data variable, so calls to ds.ems.time_coordinate
still work. It's not quite right (I'm getting the impression that xarray can't cope very well with coordinate variables that have different names to their associated dimensions!), but it allows my use case to function well enough: they just have an extra 'record' coordinate variable floating about.
Possibly related? https://github.com/pydata/xarray/issues/4417
xarray handles indexing in a way that is not useful for SHOC standard datasets. Specifically it will only create indexes for coordinates whos variable name exactly matches the single dimension name it uses. Fixing this is (I think) part of https://github.com/pydata/xarray/projects/1, also see the discussion on the flexible indexing design document
As it stands, emsarray does not modify any datasets when they are opened. Adding an index on the time coordinate is useful, but doing so automatically poses some difficulties. As emsarray is exposed as an xarray accessor it isn't always involved in opening datasets. For instances where the dataset is opened through other means, emsarray is invoked lazily. This means there is no reliable way for emsarray to modify a dataset automatically in a way that will work consistently.
A new method such as dataset.ems.ensure_indexes()
could be added that modifies a dataset in place to add all expected indexes. Users could call this method when they first open a dataset or receive the dataset from some other function, before calling any xarray methods that need indexing. I have written a similar function for another project:
def make_indexable(dataset: xarray.Dataset, name: Hashable) -> xarray.Dataset:
"""
Ensure that `name` is an indexable coordinate in `dataset`.
This is useful for adding indexes for coordinates whos name does not match the dimension name,
e.g. a 'time' coordinate defined on the 'record' dimension.
"""
if name not in dataset.coords:
dataset = dataset.set_coords([name])
if name not in dataset.indexes:
dataset = dataset.set_xindex([name])
return dataset
dataset = emsarray.open_dataset(...)
dataset = make_indexable(dataset, dataset.ems.get_time_name())
I am closing this issue, as the issue has little to do with emsarray itself. This has more to do with how xarray chooses what variables to automatically add indexes to. For the cases where xarray does not choose to add an index to a variable, the above comment has an applicable code example. The function uses no emsarray-specific functionality, and it is applicable to all dataset conventions where xarray may not add the desired indexes.
If you open a SHOC standard NetCDF file with emsarray, the
xarray.Dataset
has properties like:Note that this includes quite a few dimensions which are not linked to coordinate variables, including the
t
variable which is actually the time-coordinate variable for therecord
dimension. Because that coordinate/dim combination is not configured, it is not possible to use xarray time coordinate slicing on a SHOC standard dataset.It would be great if more (or all!) of the SHOC standard coordinate variables could be configured at https://github.com/csiro-coasts/emsarray/blob/7c1baf8c45380e6a2c8e85ab2ee7fe4650c8d9e2/src/emsarray/conventions/shoc.py#L44