Closed erratic-pattern closed 1 week ago
The test failure here is caused by a bug in the log
UDF simplifier.
I've filled an issue https://github.com/apache/datafusion/issues/10359 and submitted a PR https://github.com/apache/datafusion/pull/10360 to fix it.
My benchmark runs show this ma help q19 a bit. I'll rerun to be sure
++ critcmp main loop-expr-simplifier
group loop-expr-simplifier main
----- -------------------- ----
logical_aggregate_with_join 1.00 1211.9±11.64µs ? ?/sec 1.00 1210.9±15.89µs ? ?/sec
logical_plan_tpcds_all 1.00 157.6±1.80ms ? ?/sec 1.01 159.2±2.01ms ? ?/sec
logical_plan_tpch_all 1.00 16.9±0.21ms ? ?/sec 1.00 17.0±0.20ms ? ?/sec
logical_select_all_from_1000 1.00 18.8±0.09ms ? ?/sec 1.00 18.9±0.08ms ? ?/sec
logical_select_one_from_700 1.00 808.4±10.14µs ? ?/sec 1.02 823.6±9.81µs ? ?/sec
logical_trivial_join_high_numbered_columns 1.00 765.1±11.51µs ? ?/sec 1.00 763.0±11.30µs ? ?/sec
logical_trivial_join_low_numbered_columns 1.00 751.1±20.24µs ? ?/sec 1.00 753.6±26.51µs ? ?/sec
physical_plan_tpcds_all 1.00 1348.1±9.21ms ? ?/sec 1.01 1357.7±7.90ms ? ?/sec
physical_plan_tpch_all 1.01 93.8±0.79ms ? ?/sec 1.00 92.6±1.47ms ? ?/sec
physical_plan_tpch_q1 1.00 5.0±0.10ms ? ?/sec 1.03 5.2±0.06ms ? ?/sec
physical_plan_tpch_q10 1.00 4.4±0.04ms ? ?/sec 1.01 4.4±0.06ms ? ?/sec
physical_plan_tpch_q11 1.00 3.9±0.05ms ? ?/sec 1.04 4.1±0.07ms ? ?/sec
physical_plan_tpch_q12 1.06 3.3±0.04ms ? ?/sec 1.00 3.1±0.07ms ? ?/sec
physical_plan_tpch_q13 1.00 2.1±0.03ms ? ?/sec 1.02 2.2±0.04ms ? ?/sec
physical_plan_tpch_q14 1.00 2.8±0.05ms ? ?/sec 1.02 2.8±0.06ms ? ?/sec
physical_plan_tpch_q16 1.00 3.6±0.07ms ? ?/sec 1.05 3.8±0.07ms ? ?/sec
physical_plan_tpch_q17 1.00 3.5±0.05ms ? ?/sec 1.02 3.6±0.06ms ? ?/sec
physical_plan_tpch_q18 1.00 4.0±0.06ms ? ?/sec 1.00 4.0±0.05ms ? ?/sec
physical_plan_tpch_q19 1.11 7.0±0.07ms ? ?/sec 1.00 6.3±0.11ms ? ?/sec
physical_plan_tpch_q2 1.00 7.8±0.06ms ? ?/sec 1.02 7.9±0.12ms ? ?/sec
physical_plan_tpch_q20 1.00 4.5±0.06ms ? ?/sec 1.03 4.7±0.07ms ? ?/sec
physical_plan_tpch_q21 1.00 6.2±0.07ms ? ?/sec 1.01 6.2±0.08ms ? ?/sec
physical_plan_tpch_q22 1.00 3.4±0.06ms ? ?/sec 1.01 3.5±0.07ms ? ?/sec
physical_plan_tpch_q3 1.00 3.2±0.06ms ? ?/sec 1.03 3.3±0.07ms ? ?/sec
physical_plan_tpch_q4 1.00 2.3±0.03ms ? ?/sec 1.02 2.3±0.05ms ? ?/sec
physical_plan_tpch_q5 1.00 4.5±0.06ms ? ?/sec 1.02 4.6±0.09ms ? ?/sec
physical_plan_tpch_q6 1.00 1570.2±23.85µs ? ?/sec 1.02 1596.3±26.21µs ? ?/sec
physical_plan_tpch_q7 1.00 5.8±0.06ms ? ?/sec 1.02 5.9±0.07ms ? ?/sec
physical_plan_tpch_q8 1.00 7.3±0.07ms ? ?/sec 1.02 7.4±0.12ms ? ?/sec
physical_plan_tpch_q9 1.00 5.4±0.07ms ? ?/sec 1.05 5.7±0.07ms ? ?/sec
physical_select_all_from_1000 1.00 61.2±0.25ms ? ?/sec 1.01 61.7±0.41ms ? ?/sec
physical_select_one_from_700 1.00 3.7±0.04ms ? ?/sec 1.02 3.7±0.05ms ? ?/sec
I reran benchmarks and this PR certatainly doesn't slow things down and maybe makes it marginally faster
++ critcmp main loop-expr-simplifier
group loop-expr-simplifier main
----- -------------------- ----
logical_aggregate_with_join 1.02 1222.1±13.04µs ? ?/sec 1.00 1203.3±14.34µs ? ?/sec
logical_plan_tpcds_all 1.00 158.8±1.61ms ? ?/sec 1.00 159.2±1.57ms ? ?/sec
logical_plan_tpch_all 1.00 17.0±0.17ms ? ?/sec 1.00 17.0±0.22ms ? ?/sec
logical_select_all_from_1000 1.00 18.8±0.15ms ? ?/sec 1.00 18.8±0.14ms ? ?/sec
logical_select_one_from_700 1.00 815.8±10.12µs ? ?/sec 1.00 818.8±8.69µs ? ?/sec
logical_trivial_join_high_numbered_columns 1.00 762.9±8.56µs ? ?/sec 1.00 759.7±26.05µs ? ?/sec
logical_trivial_join_low_numbered_columns 1.00 747.9±10.40µs ? ?/sec 1.00 748.6±22.28µs ? ?/sec
physical_plan_tpcds_all 1.00 1337.1±9.07ms ? ?/sec 1.02 1364.5±7.75ms ? ?/sec
physical_plan_tpch_all 1.00 91.0±1.13ms ? ?/sec 1.03 93.8±1.39ms ? ?/sec
physical_plan_tpch_q1 1.00 5.0±0.08ms ? ?/sec 1.03 5.1±0.08ms ? ?/sec
physical_plan_tpch_q10 1.00 4.4±0.06ms ? ?/sec 1.00 4.4±0.07ms ? ?/sec
physical_plan_tpch_q11 1.01 4.0±0.07ms ? ?/sec 1.00 3.9±0.06ms ? ?/sec
physical_plan_tpch_q12 1.00 3.1±0.05ms ? ?/sec 1.02 3.2±0.08ms ? ?/sec
physical_plan_tpch_q13 1.00 2.1±0.03ms ? ?/sec 1.00 2.1±0.03ms ? ?/sec
physical_plan_tpch_q14 1.00 2.7±0.06ms ? ?/sec 1.02 2.7±0.05ms ? ?/sec
physical_plan_tpch_q16 1.00 3.7±0.06ms ? ?/sec 1.04 3.8±0.06ms ? ?/sec
physical_plan_tpch_q17 1.00 3.6±0.07ms ? ?/sec 1.02 3.6±0.09ms ? ?/sec
physical_plan_tpch_q18 1.00 3.9±0.06ms ? ?/sec 1.02 4.0±0.06ms ? ?/sec
physical_plan_tpch_q19 1.00 6.1±0.06ms ? ?/sec 1.03 6.3±0.08ms ? ?/sec
physical_plan_tpch_q2 1.00 7.7±0.07ms ? ?/sec 1.02 7.9±0.07ms ? ?/sec
physical_plan_tpch_q20 1.00 4.5±0.08ms ? ?/sec 1.04 4.7±0.07ms ? ?/sec
physical_plan_tpch_q21 1.00 6.1±0.09ms ? ?/sec 1.02 6.2±0.09ms ? ?/sec
physical_plan_tpch_q22 1.00 3.4±0.06ms ? ?/sec 1.03 3.5±0.07ms ? ?/sec
physical_plan_tpch_q3 1.00 3.1±0.05ms ? ?/sec 1.02 3.2±0.06ms ? ?/sec
physical_plan_tpch_q4 1.00 2.3±0.04ms ? ?/sec 1.00 2.3±0.04ms ? ?/sec
physical_plan_tpch_q5 1.00 4.4±0.07ms ? ?/sec 1.02 4.5±0.06ms ? ?/sec
physical_plan_tpch_q6 1.00 1569.9±20.55µs ? ?/sec 1.03 1612.7±22.24µs ? ?/sec
physical_plan_tpch_q7 1.00 5.7±0.08ms ? ?/sec 1.02 5.8±0.06ms ? ?/sec
physical_plan_tpch_q8 1.00 7.4±0.17ms ? ?/sec 1.01 7.4±0.08ms ? ?/sec
physical_plan_tpch_q9 1.00 5.5±0.07ms ? ?/sec 1.01 5.6±0.06ms ? ?/sec
physical_select_all_from_1000 1.00 60.9±0.40ms ? ?/sec 1.01 61.3±0.40ms ? ?/sec
physical_select_one_from_700 1.00 3.7±0.04ms ? ?/sec 1.00 3.7±0.04ms ? ?/sec
Marking as draft so it doesn't show up on the "ready to review list" as I think @erratic-pattern has some additional plans
I future proofed the naming a bit by renaming "iterations" to "cycles", because I can improve the algorithm a bit further to short-circuit mid-cycle and so we might later want to capture the actual "iteration" or "step" count
I've made a new algorithm for this that should in theory reduce the amount of work needed to be done by short-circuiting earlier once there is a consecutive sequence of unchanged expression trees equal to the number of rewriters. However, it ended up being harder than I thought to get actual performance improvements when comparing with local benchmarks. I tried several different approaches, but most were actually slower than the code that I have here in this PR.
I did eventually come up with something that could possibly compete with the simpler algorithm in this PR. You can compare the branch here
My guess is that the additional overhead of checking each iteration is actual significant when we are only running 3 rewriters. I think we would see bigger improvements in cases where the number of optimization rules is larger. Many of the approaches I tried required dynamic dispatch on the TreeNodeRewriter
s which was surprisingly a larger cost that I expected. The approach I ended up with avoids the dynamic dispatch which seems to be the main reason it's faster.
I would be interested in scaling this up to run across all of the OptimizationRules
where we should see a bigger improvement. However, I don't think everything has fully migrated from try_optimize
so we would have to wait for that, I think. Once that happens, it should be possible to generalize the new branch code to work with both OptimizationRule
s as well as TreeNodeRewriter
s .
Benchmark results https://github.com/apache/datafusion/pull/10386#issuecomment-2095082079
when we are only running 3 rewriters.
After reading the code in gurantee_rewriter
, I think we can even move it into simplifier. Not sure why it is an independent rule 🤔
🚀
Which issue does this PR close?
Closes #1160.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?
no