feat: run expression simplifier in a loop until a fixedpoint or 3 cycles

erratic-pattern commented 2 weeks ago

Which issue does this PR close?

Closes #1160.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

no

erratic-pattern commented 2 weeks ago

The test failure here is caused by a bug in the log UDF simplifier.

I've filled an issue https://github.com/apache/datafusion/issues/10359 and submitted a PR https://github.com/apache/datafusion/pull/10360 to fix it.

alamb commented 2 weeks ago

My benchmark runs show this ma help q19 a bit. I'll rerun to be sure

++ critcmp main loop-expr-simplifier
group                                         loop-expr-simplifier                   main
-----                                         --------------------                   ----
logical_aggregate_with_join                   1.00  1211.9±11.64µs        ? ?/sec    1.00  1210.9±15.89µs        ? ?/sec
logical_plan_tpcds_all                        1.00    157.6±1.80ms        ? ?/sec    1.01    159.2±2.01ms        ? ?/sec
logical_plan_tpch_all                         1.00     16.9±0.21ms        ? ?/sec    1.00     17.0±0.20ms        ? ?/sec
logical_select_all_from_1000                  1.00     18.8±0.09ms        ? ?/sec    1.00     18.9±0.08ms        ? ?/sec
logical_select_one_from_700                   1.00   808.4±10.14µs        ? ?/sec    1.02    823.6±9.81µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00   765.1±11.51µs        ? ?/sec    1.00   763.0±11.30µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   751.1±20.24µs        ? ?/sec    1.00   753.6±26.51µs        ? ?/sec
physical_plan_tpcds_all                       1.00   1348.1±9.21ms        ? ?/sec    1.01   1357.7±7.90ms        ? ?/sec
physical_plan_tpch_all                        1.01     93.8±0.79ms        ? ?/sec    1.00     92.6±1.47ms        ? ?/sec
physical_plan_tpch_q1                         1.00      5.0±0.10ms        ? ?/sec    1.03      5.2±0.06ms        ? ?/sec
physical_plan_tpch_q10                        1.00      4.4±0.04ms        ? ?/sec    1.01      4.4±0.06ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.9±0.05ms        ? ?/sec    1.04      4.1±0.07ms        ? ?/sec
physical_plan_tpch_q12                        1.06      3.3±0.04ms        ? ?/sec    1.00      3.1±0.07ms        ? ?/sec
physical_plan_tpch_q13                        1.00      2.1±0.03ms        ? ?/sec    1.02      2.2±0.04ms        ? ?/sec
physical_plan_tpch_q14                        1.00      2.8±0.05ms        ? ?/sec    1.02      2.8±0.06ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.6±0.07ms        ? ?/sec    1.05      3.8±0.07ms        ? ?/sec
physical_plan_tpch_q17                        1.00      3.5±0.05ms        ? ?/sec    1.02      3.6±0.06ms        ? ?/sec
physical_plan_tpch_q18                        1.00      4.0±0.06ms        ? ?/sec    1.00      4.0±0.05ms        ? ?/sec
physical_plan_tpch_q19                        1.11      7.0±0.07ms        ? ?/sec    1.00      6.3±0.11ms        ? ?/sec
physical_plan_tpch_q2                         1.00      7.8±0.06ms        ? ?/sec    1.02      7.9±0.12ms        ? ?/sec
physical_plan_tpch_q20                        1.00      4.5±0.06ms        ? ?/sec    1.03      4.7±0.07ms        ? ?/sec
physical_plan_tpch_q21                        1.00      6.2±0.07ms        ? ?/sec    1.01      6.2±0.08ms        ? ?/sec
physical_plan_tpch_q22                        1.00      3.4±0.06ms        ? ?/sec    1.01      3.5±0.07ms        ? ?/sec
physical_plan_tpch_q3                         1.00      3.2±0.06ms        ? ?/sec    1.03      3.3±0.07ms        ? ?/sec
physical_plan_tpch_q4                         1.00      2.3±0.03ms        ? ?/sec    1.02      2.3±0.05ms        ? ?/sec
physical_plan_tpch_q5                         1.00      4.5±0.06ms        ? ?/sec    1.02      4.6±0.09ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1570.2±23.85µs        ? ?/sec    1.02  1596.3±26.21µs        ? ?/sec
physical_plan_tpch_q7                         1.00      5.8±0.06ms        ? ?/sec    1.02      5.9±0.07ms        ? ?/sec
physical_plan_tpch_q8                         1.00      7.3±0.07ms        ? ?/sec    1.02      7.4±0.12ms        ? ?/sec
physical_plan_tpch_q9                         1.00      5.4±0.07ms        ? ?/sec    1.05      5.7±0.07ms        ? ?/sec
physical_select_all_from_1000                 1.00     61.2±0.25ms        ? ?/sec    1.01     61.7±0.41ms        ? ?/sec
physical_select_one_from_700                  1.00      3.7±0.04ms        ? ?/sec    1.02      3.7±0.05ms        ? ?/sec

alamb commented 2 weeks ago

I reran benchmarks and this PR certatainly doesn't slow things down and maybe makes it marginally faster

++ critcmp main loop-expr-simplifier
group                                         loop-expr-simplifier                   main
-----                                         --------------------                   ----
logical_aggregate_with_join                   1.02  1222.1±13.04µs        ? ?/sec    1.00  1203.3±14.34µs        ? ?/sec
logical_plan_tpcds_all                        1.00    158.8±1.61ms        ? ?/sec    1.00    159.2±1.57ms        ? ?/sec
logical_plan_tpch_all                         1.00     17.0±0.17ms        ? ?/sec    1.00     17.0±0.22ms        ? ?/sec
logical_select_all_from_1000                  1.00     18.8±0.15ms        ? ?/sec    1.00     18.8±0.14ms        ? ?/sec
logical_select_one_from_700                   1.00   815.8±10.12µs        ? ?/sec    1.00    818.8±8.69µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    762.9±8.56µs        ? ?/sec    1.00   759.7±26.05µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   747.9±10.40µs        ? ?/sec    1.00   748.6±22.28µs        ? ?/sec
physical_plan_tpcds_all                       1.00   1337.1±9.07ms        ? ?/sec    1.02   1364.5±7.75ms        ? ?/sec
physical_plan_tpch_all                        1.00     91.0±1.13ms        ? ?/sec    1.03     93.8±1.39ms        ? ?/sec
physical_plan_tpch_q1                         1.00      5.0±0.08ms        ? ?/sec    1.03      5.1±0.08ms        ? ?/sec
physical_plan_tpch_q10                        1.00      4.4±0.06ms        ? ?/sec    1.00      4.4±0.07ms        ? ?/sec
physical_plan_tpch_q11                        1.01      4.0±0.07ms        ? ?/sec    1.00      3.9±0.06ms        ? ?/sec
physical_plan_tpch_q12                        1.00      3.1±0.05ms        ? ?/sec    1.02      3.2±0.08ms        ? ?/sec
physical_plan_tpch_q13                        1.00      2.1±0.03ms        ? ?/sec    1.00      2.1±0.03ms        ? ?/sec
physical_plan_tpch_q14                        1.00      2.7±0.06ms        ? ?/sec    1.02      2.7±0.05ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.7±0.06ms        ? ?/sec    1.04      3.8±0.06ms        ? ?/sec
physical_plan_tpch_q17                        1.00      3.6±0.07ms        ? ?/sec    1.02      3.6±0.09ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.9±0.06ms        ? ?/sec    1.02      4.0±0.06ms        ? ?/sec
physical_plan_tpch_q19                        1.00      6.1±0.06ms        ? ?/sec    1.03      6.3±0.08ms        ? ?/sec
physical_plan_tpch_q2                         1.00      7.7±0.07ms        ? ?/sec    1.02      7.9±0.07ms        ? ?/sec
physical_plan_tpch_q20                        1.00      4.5±0.08ms        ? ?/sec    1.04      4.7±0.07ms        ? ?/sec
physical_plan_tpch_q21                        1.00      6.1±0.09ms        ? ?/sec    1.02      6.2±0.09ms        ? ?/sec
physical_plan_tpch_q22                        1.00      3.4±0.06ms        ? ?/sec    1.03      3.5±0.07ms        ? ?/sec
physical_plan_tpch_q3                         1.00      3.1±0.05ms        ? ?/sec    1.02      3.2±0.06ms        ? ?/sec
physical_plan_tpch_q4                         1.00      2.3±0.04ms        ? ?/sec    1.00      2.3±0.04ms        ? ?/sec
physical_plan_tpch_q5                         1.00      4.4±0.07ms        ? ?/sec    1.02      4.5±0.06ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1569.9±20.55µs        ? ?/sec    1.03  1612.7±22.24µs        ? ?/sec
physical_plan_tpch_q7                         1.00      5.7±0.08ms        ? ?/sec    1.02      5.8±0.06ms        ? ?/sec
physical_plan_tpch_q8                         1.00      7.4±0.17ms        ? ?/sec    1.01      7.4±0.08ms        ? ?/sec
physical_plan_tpch_q9                         1.00      5.5±0.07ms        ? ?/sec    1.01      5.6±0.06ms        ? ?/sec
physical_select_all_from_1000                 1.00     60.9±0.40ms        ? ?/sec    1.01     61.3±0.40ms        ? ?/sec
physical_select_one_from_700                  1.00      3.7±0.04ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec

alamb commented 2 weeks ago

Marking as draft so it doesn't show up on the "ready to review list" as I think @erratic-pattern has some additional plans

erratic-pattern commented 2 weeks ago

I future proofed the naming a bit by renaming "iterations" to "cycles", because I can improve the algorithm a bit further to short-circuit mid-cycle and so we might later want to capture the actual "iteration" or "step" count

erratic-pattern commented 1 week ago

I've made a new algorithm for this that should in theory reduce the amount of work needed to be done by short-circuiting earlier once there is a consecutive sequence of unchanged expression trees equal to the number of rewriters. However, it ended up being harder than I thought to get actual performance improvements when comparing with local benchmarks. I tried several different approaches, but most were actually slower than the code that I have here in this PR.

I did eventually come up with something that could possibly compete with the simpler algorithm in this PR. You can compare the branch here

My guess is that the additional overhead of checking each iteration is actual significant when we are only running 3 rewriters. I think we would see bigger improvements in cases where the number of optimization rules is larger. Many of the approaches I tried required dynamic dispatch on the TreeNodeRewriters which was surprisingly a larger cost that I expected. The approach I ended up with avoids the dynamic dispatch which seems to be the main reason it's faster.

I would be interested in scaling this up to run across all of the OptimizationRules where we should see a bigger improvement. However, I don't think everything has fully migrated from try_optimize so we would have to wait for that, I think. Once that happens, it should be possible to generalize the new branch code to work with both OptimizationRules as well as TreeNodeRewriters .

erratic-pattern commented 1 week ago

Benchmark results https://github.com/apache/datafusion/pull/10386#issuecomment-2095082079

jayzhan211 commented 1 week ago

when we are only running 3 rewriters.

After reading the code in gurantee_rewriter, I think we can even move it into simplifier. Not sure why it is an independent rule 🤔

alamb commented 1 week ago

🚀

apache / datafusion