cubed-dev / cubed

Bounded-memory serverless distributed N-dimensional array processing
https://cubed-dev.github.io/cubed/
Apache License 2.0
116 stars 14 forks source link

Basic rechunking example #539

Open norlandrhagen opened 1 month ago

norlandrhagen commented 1 month ago

Working my way through understanding cubed / cubed-xarray.

I'm trying to get an example working of modifying the chunking of an Xarray dataset and writing it to Zarr. When I roundtrip the Zarr to and from Xarray, it seems like the chunking structure hasn't changed. Is using the .chunk method on an Xarray dataset with cubed viable or should I be using rechunk primitive?

Roundtrip example using Xarray + dask chunks


import xarray as xr 
from zarr.storage import TempStore

ts = TempStore('air_temp_dask.zarr')

ds = xr.tutorial.open_dataset('air_temperature', chunks={})
rds = ds.chunk({'time':1})
rds.to_zarr(ts, consolidated=True)

rtds = xr.open_zarr(ts, chunks={})
rtds

assert rtds.chunks == rds.chunks

Roundtrip example using Xarray + cubed

from cubed import Spec
import xarray as xr 
from zarr.storage import TempStore

ts = TempStore('air_temp_cubed.zarr')

spec = Spec(work_dir='tmp', allowed_mem='2GB')
ds = xr.tutorial.open_dataset('air_temperature', chunked_array_type='cubed',
     from_array_kwargs={'spec': spec},chunks={})

rds = ds.chunk({'time':1}, chunked_array_type="cubed")

# does compute need to be called?
# rds.compute()

rds.to_zarr(ts, consolidated=True, chunkmanager_store_kwargs={'from_array_kwargs': {'spec': spec} })

rtds = xr.open_zarr(ts, chunked_array_type='cubed',
     from_array_kwargs={'spec': spec},chunks={})

# This fails
assert rtds.chunks == rds.chunks

chunked dataset (rds):

image

roundtripped dataset (rtds):

image

🤞 this is an end-of-day brain implementation issue on my end.

TomNicholas commented 1 month ago

What you're doing is correct, you should be able to use all the Xarray syntax you normally would without called anything from Cubed directly.

Using the .chunk method on Xarray is supposed to be viable but there is at least one bug in cubed-xarray that was found earlier today (see PR on cubed-xarray). There may be another bug here!

Cubed-xarray is currently under-tested relative to Cubed alone.

TomNicholas commented 1 month ago

Okay so

> rds = ds.chunk({'time':1}, chunked_array_type="cubed")

You should not need to add chunked_array_type="cubed" here, it's supposed to automatically see that you're using cubed and assume you want to keep using cubed. I don't know why that's broken on xarray main but I was in the process of refactoring xarray's .chunk method anyway in https://github.com/pydata/xarray/pull/9286 and that seems to fix it 🤷‍♂️

I'm also able to reproduce your the bug with writing out the wrong chunks. However when I instead try writing out just one array using cubed.to_zarr I see the expected chunks, i.e.

cubed.to_zarr(rds['air'].data, ts)

Which at least means the bug is in xarray / cubed-xarray rather than in cubed.

norlandrhagen commented 1 month ago

Ah good to know it's probably a cubed-xarray bug. Would it be helpful to repost/move the issue there or cross ref it?

TomNicholas commented 1 month ago

Thanks for your patience! Cross-referencing it in a new issue on cubed-xarray could be helpful.

tomwhite commented 1 month ago

I tracked down the problem - this seems to fix it: https://github.com/pydata/xarray/pull/9326

tomwhite commented 1 month ago

The Xarray issue has been merged now so it might be worth seeing if it fixes your original issue @norlandrhagen.

norlandrhagen commented 1 month ago

Just tried with the main Xarray branch and it worked! 🎈 Thanks for the fix. Is it worth pinning xarray-cubed to and above the next release of Xarray?

tomwhite commented 1 month ago

Great!

Is it worth pinning xarray-cubed to and above the next release of Xarray?

That might be a good idea.

TomNicholas commented 1 month ago

Is it worth pinning xarray-cubed to and above the next release of Xarray?

Yes definitely. I'm also about to suggest we rename the ChunkManager to ComputeManger to better reflect it's updated responsibilities in light of https://github.com/pydata/xarray/pull/9286, which would be a breaking change for cubed-xarray.