cubed-dev / cubed-xarray

Interface for using cubed with xarray
Apache License 2.0
26 stars 3 forks source link

Failure to inline #26

Open rabernat opened 1 month ago

rabernat commented 1 month ago

This is as minimal as I could make this. I can't reproduce it by just creating arrays from scratch, has to be loaded from real data afaict.

Just cubed, without xarray

import cubed
spec = cubed.Spec(
    allowed_mem=2000000000,
)

url1 = "s3://cmip6-pds/CMIP6/ScenarioMIP/IPSL/IPSL-CM6A-LR/ssp585/r1i1p1f1/6hrLev/ua/gr/v20190119/"

raw_array = cubed.from_zarr(url1 + "ua", spec=spec)
subset = raw_array[:, 10, 20, 30]
subset.visualize()

image

As you can see, the "from_zarr" and "getitem" get inlined into a single task. Good

Now with Xarray

ds = xr.open_dataset(url1, engine="zarr", chunked_array_type="cubed", from_array_kwargs={"spec": spec}, chunks={})
ds_subset = ds.isel(lat=20, lon=30, presnivs=10)
ds_subset.ua.data.visualize()

image

No inline, twice the number of tasks.

TomNicholas commented 1 month ago

Thanks for reporting this.

Not totally sure, but I suspect the behavior difference is because currently xr.open_dataset(engine='zarr') ultimately calls cubed.from_array on the lazily indexed xarray variable to create the cubed array, not cubed.from_zarr (the .from_array() call is inside Variable.chunk()). So these two code snippets don't end up calling the same cubed functions.

Interestingly dask.array.from_zarr also exists, but like cubed.from_zarr is also never called anywhere inside xarray! Maybe we could improve xarray's zarr backend by using this? You might know better than me about that though @rabernat .

tomwhite commented 1 month ago

Thanks for the reproducible example - I managed to run it fine to see what is going on.

As Tom says, Cubed is not seeing the raw Zarr file. If Cubed's from_array was given a zarr.Array instance then it would take a fast path, but on this line

https://github.com/cubed-dev/cubed/blob/d24f83bbab7032e5c54072fdb7af4430d933dd99/cubed/core/ops.py#L57

x is Xarray's ImplicitToExplicitIndexingAdapter, so it falls through to using map_blocks.

This in itself wouldn't be a problem, but the selection (indexing) is implemented uses a Cubed map_direct primitive, which can't be fused (see https://github.com/cubed-dev/cubed/issues/464), which is why you aren't seeing the inlining that you would expect.

For simple slices like these we could implement indexing using a simple map_blocks (the general case is harder since slices can cross chunk boundaries). This would be worth doing as it's such a common thing to do.

rabernat commented 1 month ago

Thanks Toms! Thoughts on the best place in the stack to fix this?

For context, this is a killer application for cubed: fast, serverless subsetting of big Zarr datasets.

TomNicholas commented 1 month ago

For simple slices like these we could implement indexing using a simple map_blocks

Sounds like this would fix your use case Ryan, and it's an optimization in cubed that should be done anyway.

If Cubed's from_array was given a zarr.Array instance

x is Xarray's ImplicitToExplicitIndexingAdapter

Presumably we can't just bypass the ImplicitToExplicitIndexingAdapter wrapping in xarray else some other lazy indexing feature will be lost? This would be a divergence from how it currently works for dask.

tomwhite commented 1 month ago

I've created a fix in https://github.com/cubed-dev/cubed/pull/586, which I ran against your example @rabernat and it fuses as expected now:

cubed2

I'll merge it later, and I could do a release too if that's helpful.