AI4S2S / s2spy

A high-level python package integrating expert knowledge and artificial intelligence to boost (sub) seasonal forecasting
https://ai4s2s.readthedocs.io/
Apache License 2.0
20 stars 7 forks source link

Storing resampled data issue due to pd.Interval dtype #79

Closed semvijverberg closed 2 years ago

semvijverberg commented 2 years ago

When I try to store a resample xr.DataArray:

image

I get the following error:

image

For completeness, this is the whole dataset: image

geek-yang commented 2 years ago

Found some similar issues after a bit of googling https://stackoverflow.com/questions/54848671/using-xarray-groupby-bins-results-in-coordinates-that-are-an-object-and-cant-be. Apparently we are not the only one. There is already an issue in xarray about supporting interval index https://github.com/pydata/xarray/issues/2847.

Currently in our implementation, the interval datatype is object, which is not supported. But from the discussions in those issues, it seems xarray does support certain interval types. Maybe we can try to play a bit with different interval types generated by pandas? (e.g. https://pandas.pydata.org/docs/reference/api/pandas.interval_range.html)

BSchilperoort commented 2 years ago

Seeing as we do not really use the intervals ourselves after applying the calendar, we can just decide to not include them in the DataArray.

If we do want to keep them, we can split the intervals up into two coordinates; interval_left and interval_right (or interval_start and interval_end. These will just be timestamps and can therefore be stored in a netCDF.

Peter9192 commented 2 years ago

Perhaps we could convert them to normal times (either 'left', 'center', or 'right'), and include a bounds array of shape 2n, where n is the length of the calendar. This is how it's done in the CF/CMOR standards.

Peter9192 commented 2 years ago

See https://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries and e.g. https://cf-xarray.readthedocs.io/en/latest/bounds.html?highlight=bounds

geek-yang commented 2 years ago

Just some interesting things to share after playing with it. When export to netcdf format, we can choose the encoding for certain coordinate, e.g. tp_aggr.to_netcdf("./tp_aggr.nc", encoding={"interval":{"dtype":'<U8'}})). Given that after resampling the saved pandas intervals are changed to object type by xarray automatically, I tried to manually set tp_aggr.to_netcdf("./tp_aggr.nc", encoding={"time":{"dtype":pd.IntervalDtype(subtype='datetime64[ns]', closed='both')}}), but xarray still complains and apparently it doesn't support pandas intervals.

I think what @Peter9192 suggests is the best solution, which is also similar to the solution given by the developer of xarray in their issue https://github.com/pydata/xarray/issues/2847#issuecomment-475918645.