dask / dask-expr

BSD 3-Clause "New" or "Revised" License
83 stars 22 forks source link

Integrate Dask Arrays properly #446

Open phofl opened 10 months ago

phofl commented 10 months ago

We can't currently go from DataFrames to arrays, #445 adds to_dask_array but this is only a bandaid for now.

I think in an ideal world we have an Array Collection that captures something like .values, then you can do array computations and go back to a DataFrame collection. Downside is that an Array collection without methods is not very helpful

I am not planning on working on this immediately, but wanted to collect thoughts on this topic

cc @mrocklin @rjzamora

fjetter commented 10 months ago

Mid term I would also like to have arrays be implemented with the expression system, if only to get rid of HLGs entirely

mrocklin commented 10 months ago

I'm curious how much benefit a symbolic Dask array helps xarray. Probably less than an expression system for xarray itself, but how much less? Cc @dcherian who can maybe help is think through this.

For example, probably the biggest benefit is more intelligent rechunking. Are chunks stored in the xarray data model or are they just given to the underlying duck arrays?

dcherian commented 10 months ago

how much benefit a symbolic Dask array helps xarray. Probably less than an expression system for xarray itself, but how much less?

A question I have struggled with. I don't think we really know without actually trying it out. To me, it seems like GroupBy is where an xarray level system makes a lot of sense.

probably the biggest benefit is more intelligent rechunking.

Absolutely, specifically we'd want to be setting read-time chunk sizes that are adapted to the workload.

Are chunks stored in the xarray data model or are they just given to the underlying duck arrays?

Just given to and taken from the underlying duck array. so a expression based dask array would be easy to slot in.

mrocklin commented 10 months ago

For groupby stuff my recollection is that those optimizations are pretty local, right? You can maybe do that today with a little bit of logic in the current xarray code base? Maybe like how pandas does groupby?

mrocklin commented 10 months ago

Mostly, my thought is that if we target the expression layer at Dask array it's maybe more likely to be done (y'all seem busy) but I'm nervous about not capturing enough of the value.

If mostly all we care about is rechunking and if xarray can use systems like flox intelligently without a fancy expression system then we're good. If there are cases when xarray specific knowledge is valuable then it might not make as much sense to prioritize Dask array in expression form.

fjetter commented 10 months ago

not make as much sense to prioritize Dask array in expression

For code complexity alone it makes sense to implement this. I don't think we have to be very sophisticated w.r.t optimizations but getting rid of HLGs would be a huge benefit for maintainability. For HLG replacement I believe we don't need much more than a blockwise expr and I suspect the Array class itself will be simpler since we don't have to deal with meta and divisions. I would really like to avoid ending up with three different systems, low level, hlg and expressions. Arrays are still lagging behind in HLG adoption which is already hurting us and I believe we should not make the same mistake again. I'd like to nuke HLGs in the next couple of months

mrocklin commented 10 months ago

Fine by me

On Mon, Dec 4, 2023 at 12:50 AM Florian Jetter @.***> wrote:

not make as much sense to prioritize Dask array in expression

For code complexity alone it makes sense to implement this. I don't think we have to be very sophisticated w.r.t optimizations but getting rid of HLGs would be a huge benefit for maintainability. For HLG replacement I believe we don't need much more than a blockwise expr and I suspect the Array class itself will be simpler since we don't have to deal with meta and divisions. I would really like to avoid ending up with three different systems, low level, hlg and expressions. Arrays are still lagging behind in HLG adoption which is already hurting us and I believe we should not make the same mistake again. I'd like to nuke HLGs in the next couple of months

— Reply to this email directly, view it on GitHub https://github.com/dask-contrib/dask-expr/issues/446#issuecomment-1838090095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHPGWXEDAG3YRY2E2LYHWFHFAVCNFSM6AAAAABABTCOMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZYGA4TAMBZGU . You are receiving this because you were mentioned.Message ID: @.***>

dcherian commented 10 months ago

If there are cases when xarray specific knowledge is valuable then it might not make as much sense to prioritize Dask array in expression form.

To me, It feels like most of the value is in automatically chunking for the user, and removing that knob from most workloads. And today Xarray is engineered so that chunks are user-specified and provided to dask.array.

How easy would it be to prototype an expression system for Xarray? a few hours at AGU?

mrocklin commented 10 months ago

To me, It feels like most of the value is in automatically chunking for the user, and removing that knob from most workloads. And today Xarray is engineered so that chunks are user-specified and provided to dask.array.

How easy would it be to prototype an expression system for Xarray? a few hours at AGU?

That sounds possible-but-optimistic. I'd have more confidence in prototyping a numpy/dask.array system and dropping it into xarray.

Given your first answer above I'm inclined to focus more on dask.array anyway rather than xarray. Especially if the following is true:

If mostly all we care about is rechunking and if xarray can use systems like flox intelligently without a fancy expression system then we're good

Do you need a full expression system to do groupbys well? If the answer is "no" and it's easy to do a little light rewriting then my sense is that we don't pursue an xarray expression system, and just do arrays. Although maybe there are other reasons to have high level expressions for xarray, like if Earthmover wanted to record what queries were common within a customerbase.

dcherian commented 10 months ago

Do you need a full expression system to do groupbys well? If the answer is "no" and it's easy to do a little light rewriting then my sense is that we don't pursue an xarray expression system, and just do arrays.

I think the answer is "probably not". I made a proposal for these higher level objects that can return "preferred chunks". We would still need to propagate these chunking heuristics down to read-time, but that seems like it could be done at the array level.

Sounds like you'd want to migrate dask.array to an expression system anyway, so let's prototype that and see where we get?

mrocklin commented 10 months ago

Looking through the API, I'm guessing that our hierarchy is mostly based on the following:

There are some other outliers like the following:

I'll bet that an initial implementation focuses on from_array, blockwise, reductions, and slicing. I think that we can deliver a lot of functionality with those.

Of course, blockwise and slicing are both pretty hairy and have, I suspect, bus factors of zero today. Someone (maybe me) probably has to go and do those first. Then it's probably easy for other people to come on afterwards.

dcherian commented 10 months ago

I wonder if @shoyer has thoughts on the advantages of building an expression system on xarray rather than dask.array. Such a thing would seem to resemble xarray-beam in some ways.

It seems like the classic issues

are solvable at the dask.array level.

tomwhite commented 10 months ago

Very interested to see this work! A couple of questions and thoughts (based on my work in Cubed):

  1. Is it a goal (or possible) for the expression system to be usable on alternative execution engines (in the way Cubed does), like Beam, Lithops, or Modal?
  2. Could the expression system model projected memory requirements? This seems like the right level to do that, and could be very valuable for improving the overall user experience.
mrocklin commented 10 months ago

Is it a goal (or possible) for the expression system to be usable on alternative execution engines (in the way Cubed does), like Beam, Lithops, or Modal?

It is not a goal of mine (I mostly just work on Dask and Coiled) but I don't think it would be hard to do. I recommend looking at the current dataframe implementation. Much of the code is about modeling the pandas API, then there are some methods/protocols like _layer which add Dask logic. One could easily add in a _pandas_apply method everywhere for example to run things on pandas operations.

To be clear, I have no intention of spending energy to support other systems, but if other people want to come in and collaborate I'm sure we could make space for them.

Could the expression system model projected memory requirements? This seems like the right level to do that, and could be very valuable for improving the overall user experience

We know the size of every chunk and the dtype, so presumably yes, on a per-chunk basis.