dask / dask-expr

BSD 3-Clause "New" or "Revised" License
82 stars 21 forks source link

Add another optimization step to accommodate ops that require expensive meta information (e.g. ``divisions``) #181

Open phofl opened 1 year ago

phofl commented 1 year ago

Matt and I chatted offline about #166 and potential traps when implementing set_index.

Generally, we shouldn't use divisions while simplifying the Expression tree and pushing stuff up and down (I'll open another issue about how we can avoid this partially with align-like stuff). Computing divisions might be expensive, accessing them multiple times for a set_index/sort_values/... call will trigger multiple computations. So it's better to avoid them whenever possible.

Some methods will require information about the divisions to select the best algorithm (merge as an example). We might pass a _simplify_down step multiple times depending on the structure of our Expression tree, which would potentially require re-computing the divisions each time that merge is hit by this.

We can avoid this conflict through adding another optimization step that comes after simplify, lower for example. This step can accommodate optimizations that require information that are relatively expensive to obtain (like divisions). Ideally we avoid multiple passes in this case.

We should keep in mind that we need a more complex structure in the future, which would require us to generalize the different optimization steps.

cc @rjzamora

rjzamora commented 1 year ago

I started digesting the discussion in https://github.com/dask-contrib/dask-expr/pull/166, and fell down a bit of a rabbit hole myself :)

This problem is reminding me of the fact that I was originally hoping to free Expr objects from needing to track both npartitions and divisions information. I eventually gave up on this idea as “too extreme”, but it may not be too late to reconsider.

For example, if we are splitting the optimization pass into different kinds of Expr changes, then it may make sense to roughly split into the following separate passes:

(can stop here if you don't need to generate a graph)

I will probably change my mind about this, just sharing my current thoughts :)