Open rabernat opened 1 month ago
Thanks for reporting this.
Not totally sure, but I suspect the behavior difference is because currently xr.open_dataset(engine='zarr')
ultimately calls cubed.from_array
on the lazily indexed xarray variable to create the cubed array, not cubed.from_zarr
(the .from_array()
call is inside Variable.chunk()
). So these two code snippets don't end up calling the same cubed functions.
Interestingly dask.array.from_zarr
also exists, but like cubed.from_zarr
is also never called anywhere inside xarray! Maybe we could improve xarray's zarr backend by using this? You might know better than me about that though @rabernat .
Thanks for the reproducible example - I managed to run it fine to see what is going on.
As Tom says, Cubed is not seeing the raw Zarr file. If Cubed's from_array
was given a zarr.Array
instance then it would take a fast path, but on this line
x
is Xarray's ImplicitToExplicitIndexingAdapter
, so it falls through to using map_blocks
.
This in itself wouldn't be a problem, but the selection (indexing) is implemented uses a Cubed map_direct
primitive, which can't be fused (see https://github.com/cubed-dev/cubed/issues/464), which is why you aren't seeing the inlining that you would expect.
For simple slices like these we could implement indexing using a simple map_blocks
(the general case is harder since slices can cross chunk boundaries). This would be worth doing as it's such a common thing to do.
Thanks Toms! Thoughts on the best place in the stack to fix this?
For context, this is a killer application for cubed: fast, serverless subsetting of big Zarr datasets.
For simple slices like these we could implement indexing using a simple
map_blocks
Sounds like this would fix your use case Ryan, and it's an optimization in cubed that should be done anyway.
If Cubed's
from_array
was given azarr.Array
instance
x
is Xarray'sImplicitToExplicitIndexingAdapter
Presumably we can't just bypass the ImplicitToExplicitIndexingAdapter
wrapping in xarray else some other lazy indexing feature will be lost? This would be a divergence from how it currently works for dask.
I've created a fix in https://github.com/cubed-dev/cubed/pull/586, which I ran against your example @rabernat and it fuses as expected now:
I'll merge it later, and I could do a release too if that's helpful.
This is as minimal as I could make this. I can't reproduce it by just creating arrays from scratch, has to be loaded from real data afaict.
Just cubed, without xarray
As you can see, the "from_zarr" and "getitem" get inlined into a single task. Good
Now with Xarray
No inline, twice the number of tasks.