Open erratic-pattern opened 1 week ago
Benchmark results between the two PRs and main
Here are my measurements on my gcp machine (it does seem to help by 1-2%)
``` ++ critcmp main loop-expr-simplifier-static-dispatch group loop-expr-simplifier-static-dispatch main ----- ------------------------------------ ---- logical_aggregate_with_join 1.00 1211.4±14.74µs ? ?/sec 1.00 1214.7±64.80µs ? ?/sec logical_plan_tpcds_all 1.00 156.4±1.31ms ? ?/sec 1.01 158.4±2.20ms ? ?/sec logical_plan_tpch_all 1.00 16.9±0.19ms ? ?/sec 1.00 16.9±0.17ms ? ?/sec logical_select_all_from_1000 1.00 18.6±0.17ms ? ?/sec 1.01 18.8±0.13ms ? ?/sec logical_select_one_from_700 1.00 814.0±10.94µs ? ?/sec 1.00 815.5±32.65µs ? ?/sec logical_trivial_join_high_numbered_columns 1.00 758.0±10.12µs ? ?/sec 1.00 759.1±19.82µs ? ?/sec logical_trivial_join_low_numbered_columns 1.00 743.0±8.02µs ? ?/sec 1.01 749.9±9.83µs ? ?/sec physical_plan_tpcds_all 1.00 1336.8±8.42ms ? ?/sec 1.01 1354.8±8.50ms ? ?/sec physical_plan_tpch_all 1.00 90.2±1.61ms ? ?/sec 1.03 93.1±1.37ms ? ?/sec physical_plan_tpch_q1 1.00 4.9±0.07ms ? ?/sec 1.06 5.1±0.09ms ? ?/sec physical_plan_tpch_q10 1.00 4.3±0.06ms ? ?/sec 1.02 4.4±0.08ms ? ?/sec physical_plan_tpch_q11 1.00 3.9±0.05ms ? ?/sec 1.02 4.0±0.06ms ? ?/sec physical_plan_tpch_q12 1.00 3.1±0.07ms ? ?/sec 1.00 3.1±0.08ms ? ?/sec physical_plan_tpch_q13 1.00 2.1±0.03ms ? ?/sec 1.04 2.2±0.06ms ? ?/sec physical_plan_tpch_q14 1.00 2.7±0.05ms ? ?/sec 1.06 2.8±0.05ms ? ?/sec physical_plan_tpch_q16 1.00 3.7±0.08ms ? ?/sec 1.04 3.8±0.09ms ? ?/sec physical_plan_tpch_q17 1.00 3.5±0.05ms ? ?/sec 1.02 3.6±0.06ms ? ?/sec physical_plan_tpch_q18 1.00 3.9±0.05ms ? ?/sec 1.00 3.9±0.07ms ? ?/sec physical_plan_tpch_q19 1.00 6.0±0.07ms ? ?/sec 1.03 6.2±0.08ms ? ?/sec physical_plan_tpch_q2 1.00 7.7±0.10ms ? ?/sec 1.02 7.9±0.06ms ? ?/sec physical_plan_tpch_q20 1.00 4.5±0.07ms ? ?/sec 1.04 4.7±0.09ms ? ?/sec physical_plan_tpch_q21 1.00 6.1±0.07ms ? ?/sec 1.04 6.3±0.06ms ? ?/sec physical_plan_tpch_q22 1.00 3.3±0.06ms ? ?/sec 1.04 3.4±0.07ms ? ?/sec physical_plan_tpch_q3 1.00 3.1±0.05ms ? ?/sec 1.01 3.1±0.06ms ? ?/sec physical_plan_tpch_q4 1.00 2.3±0.05ms ? ?/sec 1.02 2.3±0.05ms ? ?/sec physical_plan_tpch_q5 1.00 4.5±0.07ms ? ?/sec 1.02 4.5±0.07ms ? ?/sec physical_plan_tpch_q6 1.00 1538.7±18.47µs ? ?/sec 1.03 1586.4±21.54µs ? ?/sec physical_plan_tpch_q7 1.00 5.7±0.08ms ? ?/sec 1.01 5.7±0.13ms ? ?/sec physical_plan_tpch_q8 1.00 7.3±0.10ms ? ?/sec 1.01 7.4±0.08ms ? ?/sec physical_plan_tpch_q9 1.00 5.6±0.09ms ? ?/sec 1.00 5.6±0.08ms ? ?/sec physical_select_all_from_1000 1.00 60.6±0.25ms ? ?/sec 1.01 61.4±0.29ms ? ?/sec physical_select_one_from_700 1.00 3.6±0.05ms ? ?/sec 1.02 3.7±0.04ms ? ?/sec ```
My second run was consistent:
``` ++ critcmp main loop-expr-simplifier-static-dispatch group loop-expr-simplifier-static-dispatch main ----- ------------------------------------ ---- logical_aggregate_with_join 1.00 1213.6±9.49µs ? ?/sec 1.01 1220.2±37.22µs ? ?/sec logical_plan_tpcds_all 1.00 158.8±1.65ms ? ?/sec 1.01 160.0±1.55ms ? ?/sec logical_plan_tpch_all 1.00 16.9±0.21ms ? ?/sec 1.00 16.9±0.20ms ? ?/sec logical_select_all_from_1000 1.00 18.8±0.12ms ? ?/sec 1.00 18.7±0.10ms ? ?/sec logical_select_one_from_700 1.00 809.7±10.72µs ? ?/sec 1.00 812.4±11.32µs ? ?/sec logical_trivial_join_high_numbered_columns 1.00 756.6±7.03µs ? ?/sec 1.01 761.9±7.95µs ? ?/sec logical_trivial_join_low_numbered_columns 1.00 742.9±13.09µs ? ?/sec 1.01 748.3±14.13µs ? ?/sec physical_plan_tpcds_all 1.00 1337.9±9.37ms ? ?/sec 1.01 1346.9±8.49ms ? ?/sec physical_plan_tpch_all 1.00 92.2±1.47ms ? ?/sec 1.00 92.0±1.27ms ? ?/sec physical_plan_tpch_q1 1.00 5.0±0.09ms ? ?/sec 1.03 5.1±0.07ms ? ?/sec physical_plan_tpch_q10 1.00 4.4±0.11ms ? ?/sec 1.00 4.4±0.08ms ? ?/sec physical_plan_tpch_q11 1.00 3.9±0.06ms ? ?/sec 1.02 4.0±0.08ms ? ?/sec physical_plan_tpch_q12 1.00 3.1±0.06ms ? ?/sec 1.01 3.1±0.05ms ? ?/sec physical_plan_tpch_q13 1.00 2.1±0.04ms ? ?/sec 1.00 2.1±0.04ms ? ?/sec physical_plan_tpch_q14 1.00 2.7±0.05ms ? ?/sec 1.04 2.8±0.06ms ? ?/sec physical_plan_tpch_q16 1.00 3.7±0.08ms ? ?/sec 1.03 3.8±0.07ms ? ?/sec physical_plan_tpch_q17 1.01 3.6±0.07ms ? ?/sec 1.00 3.6±0.07ms ? ?/sec physical_plan_tpch_q18 1.00 4.0±0.07ms ? ?/sec 1.01 4.0±0.05ms ? ?/sec physical_plan_tpch_q19 1.00 6.2±0.07ms ? ?/sec 1.02 6.3±0.07ms ? ?/sec physical_plan_tpch_q2 1.00 7.8±0.09ms ? ?/sec 1.01 7.9±0.07ms ? ?/sec physical_plan_tpch_q20 1.00 4.6±0.08ms ? ?/sec 1.00 4.6±0.07ms ? ?/sec physical_plan_tpch_q21 1.00 6.2±0.08ms ? ?/sec 1.01 6.3±0.09ms ? ?/sec physical_plan_tpch_q22 1.00 3.4±0.08ms ? ?/sec 1.02 3.5±0.09ms ? ?/sec physical_plan_tpch_q3 1.00 3.1±0.07ms ? ?/sec 1.00 3.2±0.05ms ? ?/sec physical_plan_tpch_q4 1.00 2.3±0.06ms ? ?/sec 1.01 2.3±0.05ms ? ?/sec physical_plan_tpch_q5 1.00 4.4±0.06ms ? ?/sec 1.03 4.6±0.05ms ? ?/sec physical_plan_tpch_q6 1.00 1551.7±25.84µs ? ?/sec 1.03 1592.4±22.64µs ? ?/sec physical_plan_tpch_q7 1.00 5.7±0.06ms ? ?/sec 1.00 5.7±0.09ms ? ?/sec physical_plan_tpch_q8 1.00 7.4±0.08ms ? ?/sec 1.02 7.5±0.07ms ? ?/sec physical_plan_tpch_q9 1.00 5.7±0.09ms ? ?/sec 1.00 5.7±0.08ms ? ?/sec physical_select_all_from_1000 1.00 61.2±0.81ms ? ?/sec 1.00 61.3±0.33ms ? ?/sec physical_select_one_from_700 1.01 3.7±0.05ms ? ?/sec 1.00 3.6±0.04ms ? ?/sec ```
Other PR has been merged in: https://github.com/apache/datafusion/pull/10358
I think we can merge / rebase this PR now and mark it ready for review
I will take another look at this and see if I can clean it up a bit more.
Alright I think this is in a good state now. I added a dynamic dispatch API which I think could be useful if we ever add TreeNodeRewriter
impl for OptimizerRule
, then we could run the whole optimizer through this cycle logic.
I made some stuff public for doctests, but if we don't want to do that I can just mark the doctest as ignored
I removed the dynamic dispatch API for now because I think there is a better way to write it, but I can't really do that until we get to a point where we have a TreeNodeRewriter
impl for Arc<dyn OptimizerRule>
I think the logic is correct, although it takes me sometime to understand the difference between the first pass and other pass, but I did not have a better design about this.
I think the logic is correct, although it takes me sometime to understand the difference between the first pass and other pass, but I did not have a better design about this.
We need to figure out how many rewrites are in a cycle. Since there is no Vec
or other data structure, we cannot use a length, so we manually count and then record the cycle length at the end of the first pass.
Note that the dynamic dispatch API that I deleted from this PR does not have this problem, since we can simply set cycle_length
to vec.len()
.
You can make this code to not have a special first pass if you wanted to make it simpler to understand. If you check cycle_length.is_none()
in record_cycle_length
before setting the cycle_length
, you could call it at the end of each cycle and then you can just run everything in a single try_fold
loop. However, I wanted to avoid unneccessary conditional checks, so I run the first cycle, record the cycle length, the continue with the remaining cycles.
This makes the code and the API kind of complicated for something that should be relatively simple. But I found dynamic dispatch and additional conditional checks to have a significant performance regressions when we're operating over a small number of rewriters (only 3 rewriters here), so I wanted to make an API useable with static dispatch.
A dynamic dispatch API is more appropriate for the top-level optimizer because:
Which issue does this PR close?
Closes https://github.com/apache/datafusion/issues/1160.
Rationale for this change
This is a follow up to https://github.com/apache/datafusion/pull/10358 with a new approach that should short-circuit earlier. See previous discussion there.
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?
no