feat: RewriteCycle API for short-circuiting optimizer loops

erratic-pattern commented 1 week ago

Which issue does this PR close?

Closes https://github.com/apache/datafusion/issues/1160.

Rationale for this change

This is a follow up to https://github.com/apache/datafusion/pull/10358 with a new approach that should short-circuit earlier. See previous discussion there.

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

no

erratic-pattern commented 1 week ago

Benchmark results between the two PRs and main

alamb commented 1 week ago

Here are my measurements on my gcp machine (it does seem to help by 1-2%)

Details

``` ++ critcmp main group ----- logical_aggregate_with_join logical_plan_tpcds_all logical_plan_tpch_all logical_select_all_from_1000 logical_select_one_from_700 logical_trivial_join_h logical_trivial_join_l physical_plan_tpcds_all physical_plan_tpch_all physical_plan_tpch_q1 physical_plan_tpch_q10 physical_plan_tpch_q11 physical_plan_tpch_q12 physical_plan_tpch_q13 physical_plan_tpch_q14 physical_plan_tpch_q16 physical_plan_tpch_q17 physical_plan_tpch_q18 physical_plan_tpch_q19 physical_plan_tpch_q2 physical_plan_tpch_q20 physical_plan_tpch_q21 physical_plan_tpch_q22 physical_plan_tpch_q3 physical_plan_tpch_q4 physical_plan_tpch_q5 physical_plan_tpch_q6 physical_plan_tpch_q7 physical_plan_tpch_q8 physical_plan_tpch_q9 physical_select_all_from_1000 physical_select_one_from_700 ``` loop-expr-simplifier-static-dispatch loop-expr-simplifier-static-dispatch main ------------------------------------ ---- 1.00 1211.4±14.74µs ? ?/sec 1.00 1214.7±64.80µs ? ?/sec 1.00 156.4±1.31ms ? ?/sec 1.01 158.4±2.20ms ? ?/sec 1.00 16.9±0.19ms ? ?/sec 1.00 16.9±0.17ms ? ?/sec 1.00 18.6±0.17ms ? ?/sec 1.01 18.8±0.13ms ? ?/sec 1.00 814.0±10.94µs ? ?/sec 1.00 815.5±32.65µs ? ?/sec igh_numbered_columns 1.00 758.0±10.12µs ? ?/sec 1.00 759.1±19.82µs ? ?/sec ow_numbered_columns 1.00 743.0±8.02µs ? ?/sec 1.01 749.9±9.83µs ? ?/sec 1.00 1336.8±8.42ms ? ?/sec 1.01 1354.8±8.50ms ? ?/sec 1.00 90.2±1.61ms ? ?/sec 1.03 93.1±1.37ms ? ?/sec 1.00 4.9±0.07ms ? ?/sec 1.06 5.1±0.09ms ? ?/sec 1.00 4.3±0.06ms ? ?/sec 1.02 4.4±0.08ms ? ?/sec 1.00 3.9±0.05ms ? ?/sec 1.02 4.0±0.06ms ? ?/sec 1.00 3.1±0.07ms ? ?/sec 1.00 3.1±0.08ms ? ?/sec 1.00 2.1±0.03ms ? ?/sec 1.04 2.2±0.06ms ? ?/sec 1.00 2.7±0.05ms ? ?/sec 1.06 2.8±0.05ms ? ?/sec 1.00 3.7±0.08ms ? ?/sec 1.04 3.8±0.09ms ? ?/sec 1.00 3.5±0.05ms ? ?/sec 1.02 3.6±0.06ms ? ?/sec 1.00 3.9±0.05ms ? ?/sec 1.00 3.9±0.07ms ? ?/sec 1.00 6.0±0.07ms ? ?/sec 1.03 6.2±0.08ms ? ?/sec 1.00 7.7±0.10ms ? ?/sec 1.02 7.9±0.06ms ? ?/sec 1.00 4.5±0.07ms ? ?/sec 1.04 4.7±0.09ms ? ?/sec 1.00 6.1±0.07ms ? ?/sec 1.04 6.3±0.06ms ? ?/sec 1.00 3.3±0.06ms ? ?/sec 1.04 3.4±0.07ms ? ?/sec 1.00 3.1±0.05ms ? ?/sec 1.01 3.1±0.06ms ? ?/sec 1.00 2.3±0.05ms ? ?/sec 1.02 2.3±0.05ms ? ?/sec 1.00 4.5±0.07ms ? ?/sec 1.02 4.5±0.07ms ? ?/sec 1.00 1538.7±18.47µs ? ?/sec 1.03 1586.4±21.54µs ? ?/sec 1.00 5.7±0.08ms ? ?/sec 1.01 5.7±0.13ms ? ?/sec 1.00 7.3±0.10ms ? ?/sec 1.01 7.4±0.08ms ? ?/sec 1.00 5.6±0.09ms ? ?/sec 1.00 5.6±0.08ms ? ?/sec 1.00 60.6±0.25ms ? ?/sec 1.01 61.4±0.29ms ? ?/sec 1.00 3.6±0.05ms ? ?/sec 1.02 3.7±0.04ms ? ?/sec

alamb commented 1 week ago

My second run was consistent:

Details

``` ++ critcmp main loop-expr-simplifier-static-dispatch group loop-expr-simplifier-static-dispatch main ----- ------------------------------------ ---- logical_aggregate_with_join 1.00 1213.6±9.49µs ? ?/sec 1.01 1220.2±37.22µs ? ?/sec logical_plan_tpcds_all 1.00 158.8±1.65ms ? ?/sec 1.01 160.0±1.55ms ? ?/sec logical_plan_tpch_all 1.00 16.9±0.21ms ? ?/sec 1.00 16.9±0.20ms ? ?/sec logical_select_all_from_1000 1.00 18.8±0.12ms ? ?/sec 1.00 18.7±0.10ms ? ?/sec logical_select_one_from_700 1.00 809.7±10.72µs ? ?/sec 1.00 812.4±11.32µs ? ?/sec logical_trivial_join_high_numbered_columns 1.00 756.6±7.03µs ? ?/sec 1.01 761.9±7.95µs ? ?/sec logical_trivial_join_low_numbered_columns 1.00 742.9±13.09µs ? ?/sec 1.01 748.3±14.13µs ? ?/sec physical_plan_tpcds_all 1.00 1337.9±9.37ms ? ?/sec 1.01 1346.9±8.49ms ? ?/sec physical_plan_tpch_all 1.00 92.2±1.47ms ? ?/sec 1.00 92.0±1.27ms ? ?/sec physical_plan_tpch_q1 1.00 5.0±0.09ms ? ?/sec 1.03 5.1±0.07ms ? ?/sec physical_plan_tpch_q10 1.00 4.4±0.11ms ? ?/sec 1.00 4.4±0.08ms ? ?/sec physical_plan_tpch_q11 1.00 3.9±0.06ms ? ?/sec 1.02 4.0±0.08ms ? ?/sec physical_plan_tpch_q12 1.00 3.1±0.06ms ? ?/sec 1.01 3.1±0.05ms ? ?/sec physical_plan_tpch_q13 1.00 2.1±0.04ms ? ?/sec 1.00 2.1±0.04ms ? ?/sec physical_plan_tpch_q14 1.00 2.7±0.05ms ? ?/sec 1.04 2.8±0.06ms ? ?/sec physical_plan_tpch_q16 1.00 3.7±0.08ms ? ?/sec 1.03 3.8±0.07ms ? ?/sec physical_plan_tpch_q17 1.01 3.6±0.07ms ? ?/sec 1.00 3.6±0.07ms ? ?/sec physical_plan_tpch_q18 1.00 4.0±0.07ms ? ?/sec 1.01 4.0±0.05ms ? ?/sec physical_plan_tpch_q19 1.00 6.2±0.07ms ? ?/sec 1.02 6.3±0.07ms ? ?/sec physical_plan_tpch_q2 1.00 7.8±0.09ms ? ?/sec 1.01 7.9±0.07ms ? ?/sec physical_plan_tpch_q20 1.00 4.6±0.08ms ? ?/sec 1.00 4.6±0.07ms ? ?/sec physical_plan_tpch_q21 1.00 6.2±0.08ms ? ?/sec 1.01 6.3±0.09ms ? ?/sec physical_plan_tpch_q22 1.00 3.4±0.08ms ? ?/sec 1.02 3.5±0.09ms ? ?/sec physical_plan_tpch_q3 1.00 3.1±0.07ms ? ?/sec 1.00 3.2±0.05ms ? ?/sec physical_plan_tpch_q4 1.00 2.3±0.06ms ? ?/sec 1.01 2.3±0.05ms ? ?/sec physical_plan_tpch_q5 1.00 4.4±0.06ms ? ?/sec 1.03 4.6±0.05ms ? ?/sec physical_plan_tpch_q6 1.00 1551.7±25.84µs ? ?/sec 1.03 1592.4±22.64µs ? ?/sec physical_plan_tpch_q7 1.00 5.7±0.06ms ? ?/sec 1.00 5.7±0.09ms ? ?/sec physical_plan_tpch_q8 1.00 7.4±0.08ms ? ?/sec 1.02 7.5±0.07ms ? ?/sec physical_plan_tpch_q9 1.00 5.7±0.09ms ? ?/sec 1.00 5.7±0.08ms ? ?/sec physical_select_all_from_1000 1.00 61.2±0.81ms ? ?/sec 1.00 61.3±0.33ms ? ?/sec physical_select_one_from_700 1.01 3.7±0.05ms ? ?/sec 1.00 3.6±0.04ms ? ?/sec ```

alamb commented 1 week ago

Other PR has been merged in: https://github.com/apache/datafusion/pull/10358

I think we can merge / rebase this PR now and mark it ready for review

erratic-pattern commented 5 days ago

I will take another look at this and see if I can clean it up a bit more.

erratic-pattern commented 4 days ago

Alright I think this is in a good state now. I added a dynamic dispatch API which I think could be useful if we ever add TreeNodeRewriter impl for OptimizerRule, then we could run the whole optimizer through this cycle logic.

I made some stuff public for doctests, but if we don't want to do that I can just mark the doctest as ignored

erratic-pattern commented 4 days ago

I removed the dynamic dispatch API for now because I think there is a better way to write it, but I can't really do that until we get to a point where we have a TreeNodeRewriter impl for Arc<dyn OptimizerRule>

jayzhan211 commented 4 days ago

I think the logic is correct, although it takes me sometime to understand the difference between the first pass and other pass, but I did not have a better design about this.

erratic-pattern commented 3 days ago

I think the logic is correct, although it takes me sometime to understand the difference between the first pass and other pass, but I did not have a better design about this.

We need to figure out how many rewrites are in a cycle. Since there is no Vec or other data structure, we cannot use a length, so we manually count and then record the cycle length at the end of the first pass.

Note that the dynamic dispatch API that I deleted from this PR does not have this problem, since we can simply set cycle_length to vec.len().

You can make this code to not have a special first pass if you wanted to make it simpler to understand. If you check cycle_length.is_none() in record_cycle_length before setting the cycle_length, you could call it at the end of each cycle and then you can just run everything in a single try_fold loop. However, I wanted to avoid unneccessary conditional checks, so I run the first cycle, record the cycle length, the continue with the remaining cycles.

This makes the code and the API kind of complicated for something that should be relatively simple. But I found dynamic dispatch and additional conditional checks to have a significant performance regressions when we're operating over a small number of rewriters (only 3 rewriters here), so I wanted to make an API useable with static dispatch.

A dynamic dispatch API is more appropriate for the top-level optimizer because:

it's already using dynamic dispatch
the overhead of dynamic dispatch should be less significant of a % of execution time when we are operating over a larger number of rewriters

apache / datafusion