cubed-dev / cubed

Bounded-memory serverless distributed N-dimensional array processing
https://cubed-dev.github.io/cubed/
Apache License 2.0
97 stars 7 forks source link

Pangeo TEM example #145

Closed tomwhite closed 1 month ago

tomwhite commented 1 year ago

This issue is to explore running the "Transformed Eulerian Mean diagnostic" example in https://github.com/dcherian/ncar-challenge-suite/blob/main/tem.ipynb using Cubed.

It uses Xarray, so needs https://github.com/pydata/xarray/pull/7019

tomwhite commented 1 year ago

Here's a jupyter notebook: https://github.com/tomwhite/cubed/blob/f5ece5b068db014f828bf6f3afcf6b05280af52a/examples/pangeo-tem.ipynb

There are a couple of pieces missing. The first is fairly minor, and is that mean doesn't work yet with NaNs, so I've added skipna=False just to make progress. It's tracked in https://github.com/pydata/xarray/issues/7243.

The main thing that's missing is a Cubed-compatible path for xarray groupby. I think it's probably worth using the Flox path in xarray here, but then that will require a Cubed implementation for xarray apply_ufunc, which is being tracked in #67, and has been started by @TomNicholas in #119. Does this sound right @dcherian, or is there a reason to start by supporting the default internal xarray groupby?

dcherian commented 1 year ago

Yes you'll want apply_ufunc anyway.

We'll also want to support cubed in flox, which should be mostly easy since I'm using dask primitives. The main issue is how to deal with the intermediates. ATM Flox uses dictionaries like this

{
    "groups": np.array([1, 2, 3, 4])  # shape (ngroups,)
    "intermediates": (np.array([4,5,6,7]), np.array([1, 1, 1, 1]))  # so tuple(sum, counts) with shape (..., ngroups)
}

for the mean reduction

That said, for this particular problem I could write the groupby as an efficient indexing + reduction. That's probably better since we'll have a "pure array" version of the problem, which will be easy for experimentation and comparison across libraries

tomwhite commented 1 year ago

Thanks @dcherian.

For intermediates, Cubed uses structured arrays (e.g. mean), which seems similar to Flox.

That said, for this particular problem I could write the groupby as an efficient indexing + reduction.

Agree that would be good. Although we'll probably need to add integer array indexing to Cubed - see https://github.com/data-apis/array-api/issues/177 for a bit of discussion about that from a standardisation point of view.

tomwhite commented 1 year ago

I've added integer array indexing (and take) to Cubed now (0b59b3241d06b2e57ba4011c538def5e245d7a5d), which should help for group by.

I've update the example notebook at https://github.com/tomwhite/cubed/blob/5eb5f23c25c37ec8634eb35a71711f4eaaffd643/examples/pangeo-tem.ipynb (this needed a few xarray changes). It doesn't use Flox group by yet though.