COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
58 stars 25 forks source link

Add time_bounds breaks serialisation #284

Closed aidanheerdegen closed 2 years ago

aidanheerdegen commented 2 years ago

The change in #111 adds the time_bounds variable as an attribute to the returned xarray.DataArray from cc.querying.getvar.

This breaks serialisation with a TypeError exception, as the attribute cannot be serialised:

TypeError: Invalid value for attr 'time_bounds': <xarray.DataArray 'time_bounds' (time: 3, nv: 2)>
dask.array<open_dataset-87e8951a30db46c718a161604d62f168time_bounds, shape=(3, 2), dtype=timedelta64[ns], chunksize=(1, 2), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object 1950-01-16 12:00:00 ... 1950-03-16 12:00:00
  * nv       (nv) float64 1.0 2.0
Attributes:
    long_name:  time axis boundaries
    calendar:   NOLEAP. For serialization to netCDF files, its value must be of one of the following types: str, Number, ndarray, number, list, tuple

Example:

import cosima_cookbook as cc
import xarray as xr

session = cc.database.create_session()
control = '01deg_jra55v13_ryf9091'

temp = cc.querying.getvar(control, 'temp', session, n=1, frequency='1 monthly')

temp.to_netcdf('/tmp/tmp.nc')
Full stack trace: ```python --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [18], in ----> 1 temp.to_netcdf('/tmp/tmp.nc') File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/xarray/core/dataarray.py:2846, in DataArray.to_netcdf(self, *args, **kwargs) 2842 else: 2843 # No problems with the name - so we're fine! 2844 dataset = self.to_dataset() -> 2846 return dataset.to_netcdf(*args, **kwargs) File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/xarray/core/dataset.py:1900, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf) 1897 encoding = {} 1898 from ..backends.api import to_netcdf -> 1900 return to_netcdf( 1901 self, 1902 path, 1903 mode, 1904 format=format, 1905 group=group, 1906 engine=engine, 1907 encoding=encoding, 1908 unlimited_dims=unlimited_dims, 1909 compute=compute, 1910 invalid_netcdf=invalid_netcdf, 1911 ) File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/xarray/backends/api.py:1025, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf) 1023 # validate Dataset keys, DataArray names, and attr keys/values 1024 _validate_dataset_names(dataset) -> 1025 _validate_attrs(dataset, invalid_netcdf=invalid_netcdf and engine == "h5netcdf") 1027 try: 1028 store_open = WRITEABLE_STORES[engine] File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/xarray/backends/api.py:175, in _validate_attrs(dataset, invalid_netcdf) 173 for variable in dataset.variables.values(): 174 for k, v in variable.attrs.items(): --> 175 check_attr(k, v, valid_types) File /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.01/lib/python3.9/site-packages/xarray/backends/api.py:162, in _validate_attrs..check_attr(name, value, valid_types) 156 raise TypeError( 157 f"Invalid name for attr: {name!r} must be a string for " 158 "serialization to netCDF files" 159 ) 161 if not isinstance(value, valid_types): --> 162 raise TypeError( 163 f"Invalid value for attr {name!r}: {value!r}. For serialization to " 164 "netCDF files, its value must be of one of the following types: " 165 f"{', '.join([vtype.__name__ for vtype in valid_types])}" 166 ) TypeError: Invalid value for attr 'time_bounds': dask.array Coordinates: * time (time) object 1950-01-16 12:00:00 ... 1950-03-16 12:00:00 * nv (nv) float64 1.0 2.0 Attributes: long_name: time axis boundaries calendar: NOLEAP. For serialization to netCDF files, its value must be of one of the following types: str, Number, ndarray, number, list, tuple ```

Ping @wghuneke

aidanheerdegen commented 2 years ago

@angus-g Seems this was added from a request by @aekiss (#111).

I can see why time_bounds were added as an attribute, as getvar returns a xarray.DataArray, so it can't be simply added as another variable.

Some possible fixes:

Related to the first option of returning an xarray.Dataset, I wrote some routines to pull a single variable out of a dataset, but also keeping all related variables, so any that are referenced as bounds or coordinates in the variable attributes, or the attributes of related variables:

https://github.com/coecms/splitvar/blob/master/splitvar/splitvar.py#L180-L257

If this were the preferred option these functions could be repurposed.

angus-g commented 2 years ago

There's merit to all the options, or maybe even some combination of them. It seems easiest to just return an xarray.Dataset, but I'm not sure how much that'll break things for people who are not expecting that. Otherwise, the direct serialisation of xarray.DataArray has always struck me as a bit weirder than of xarray.Dataset, so in that case it seems possible to do some kind of preprocessing.

Another possibility could be to de-xarray the time bounds, as it seems like it would be happy to serialise a bare ndarray attribute? I'm not sure if/how people are actually using the time bounds, maybe it would be inconvenient to work with without the extra metadata.

aidanheerdegen commented 2 years ago

I agree that returning an xarray.Dataset seems the most technically appropriate solution, and I also don't know how much that will affect users, thinking about it, it could be quite "breaky". Most workflows will work directly with the returned variable, but under this scenario they would have to select the variable out of the dataset. Not onerous, but so ubiquitous as to be a real PITA.

I'm not a fan of turning the bounds into an ndarray: it is "degrading" the utility of the data, stripping it off metadata and such for not much gain.

Ok, how about this for a proposal: remove the automatic insertion of bounds into attributes, but add an option to allow users to request getvar return an xarray.Dataset with all related variables included. This will round-trip fine through serialisation/de-serialisation and gives users the option of getting all the related data for a variable, which is arguably actually quite useful and adds utility. At some point in the future this could be the default if it was considered popular enough.

aekiss commented 2 years ago

I like your last suggestion @aidanheerdegen. I use time bounds to fix the sea ice time axis but AFAIK it isn't widely used by others, so I don't think we should break everyone's code to support it. It seems much better to me to provide it as an option for anyone needing it.