dask / dask-expr

BSD 3-Clause "New" or "Revised" License
86 stars 24 forks source link

Expression rewriting options #3

Closed wence- closed 1 year ago

wence- commented 1 year ago

@rjzamora pointed me at this, and I had some thoughts around the term rewriting aspects.

peephole-based rewrite systems like matchpy are quite powerful, but do suffer from some flaws if local rewrites aren't always sending you downhill in the cost function landscape. To enumerate all possible expressions and pick the best one with backtracking is exponential in the number of rewrites and size of the expression. One also needs to be a little bit careful applying all rules that one doesn't have pairs of rules that undo each other's work.

An alternate approach, that may be too heavy-weight in this instance, is equality saturation using e-graphs. These are a compact datastructure for representing all combinations of rewritten terms at once (polynomial rather than exponential). I like the egg library which comes with a (rather underdocumented) python interface via snake-egg. Probably there is no need to consider it at the moment (since exploring the landscape of what the expression system looks like is more difficult than mechanically translating from one rewriting system to another).

mrocklin commented 1 year ago

Thanks for engaging @wence-

Some responses to your thoughts above:

  1. I don't really care about matchpy vs something else at this point. The main obstacle to getting something working, I think, is creating some abstract expression system (without any optimizations even) and constructing a doable migration path
  2. I think that rewrite rules, even applied greedily, will be fine for the optimizations that are most impactful (bubbling down column projections and filtering). If someone is going to build a huge/great optimization system around dask dataframes then sure, we should think about other systems. Before then though I think that we're in a premature architecture state. If y'all are planning to commit to something something big here then let me know and I'll engage.
  3. I've also heard good things about the egg library

Probably there is no need to consider it at the moment (since exploring the landscape of what the expression system looks like is more difficult than mechanically translating from one rewriting system to another)

Yeah, mostly I think I'm saying this. Also, I don't think that we're closing any doors at this point. Most of the work in this repository is orthogonal to the underlying expression system.

wence- commented 1 year ago

2. I think that rewrite rules, even applied greedily, will be fine for the optimizations that are most impactful (bubbling down column projections and filtering). If someone is going to build a huge/great optimization system around dask dataframes then sure, we should think about other systems. Before then though I think that we're in a premature architecture state. If y'all are planning to commit to something something big here then let me know and I'll engage.

I don't think we have anything planned right now. I was just noting some experience from expression rewriting in other contexts. Depending on the number and/or complexity of the rule set, debugging why things don't simplify as you might want can be difficult: peephole rewrites after very order-dependent. In any case, pretty sure this kind of thing can be bolted on afterwards.

mrocklin commented 1 year ago

Depending on the number and/or complexity of the rule set, debugging why things don't simplify as you might want can be difficult: peephole rewrites after very order-dependent

Yeah, again, agreeing in general, but in this specific case I don't think it'll matter much.

In any case, pretty sure this kind of thing can be bolted on afterwards

+1. I think that if you look through the code you'll find that almost none of this is matchpy specific. This isn't really the primary issue right now.

Closing for now to keep things tidy. Let's reopen in the future when this is more of an issue.

mrocklin commented 1 year ago

@wence- I should also say that if you want to explore this and try swapping out matchpy for egg I'd be excited to see that effort. I often close things unnecessarily because I interpret issues as "you should consider doing this work" rather than the "I'd be down to do this work if it's in scope". If you were in the latter mode and I shut you down then I apologize.