[PERF] Improve hash table probe side decisions for Swordfish

desmondcheongzx commented 3 days ago

This PR lifts statistics into optimized logical plans so that they're available for local execution plans. It then uses these newly available statistics to make better decisions on whether to build the probe table of a hash join on the left or right side.

Benchmark results

For TPC-H, this gives us some notable speedups with Q5, Q8, and Q19.

Q5: ~2.2x (in memory), ~1.5 (parquet)
Q8: ~5.9x (in memory), ~2.2x (parquet)
Q19: ~7x (in memory), ~4.5x (parquet)

Crucially, with this change, our native runner is now faster (or within some small deviation) than our previous python runner for all 22 TPC-H queries.

For more detailed results, we have:

Q5

Before

--------------------------------------------------------------------------------- benchmark 'q5-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-5]     28.7213 (1.0)      30.3664 (1.0)      29.4645 (1.0)      0.3849 (1.0)      29.4894 (1.0)      0.5529 (1.0)          10;0  33.9391 (1.0)          33           1
test_tpch[1-in-memory-native-5]     30.9980 (1.08)     34.1489 (1.12)     32.2151 (1.09)     0.7150 (1.86)     32.2903 (1.09)     1.1586 (2.10)         10;0  31.0413 (0.91)         31           1
test_tpch[1-parquet-python-5]       48.8010 (1.70)     51.8535 (1.71)     50.0985 (1.70)     0.8342 (2.17)     50.0400 (1.70)     1.4193 (2.57)          9;0  19.9607 (0.59)         20           1
test_tpch[1-parquet-native-5]       51.1122 (1.78)     54.0755 (1.78)     52.3799 (1.78)     0.8317 (2.16)     52.4268 (1.78)     1.2526 (2.27)          8;0  19.0913 (0.56)         20           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

--------------------------------------------------------------------------------- benchmark 'q5-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-5]     13.8393 (1.0)      16.2712 (1.0)      14.4352 (1.0)      0.5322 (2.13)     14.3212 (1.0)      0.5084 (1.71)          9;5  69.2750 (1.0)          62           1
test_tpch[1-in-memory-python-5]     28.3915 (2.05)     29.5304 (1.81)     28.8477 (2.00)     0.2501 (1.0)      28.8299 (2.01)     0.2971 (1.0)          10;1  34.6649 (0.50)         35           1
test_tpch[1-parquet-native-5]       34.3952 (2.49)     36.4230 (2.24)     35.4314 (2.45)     0.4543 (1.82)     35.3678 (2.47)     0.5042 (1.70)          8;1  28.2235 (0.41)         28           1
test_tpch[1-parquet-python-5]       55.7339 (4.03)     57.3051 (3.52)     56.4564 (3.91)     0.5101 (2.04)     56.2507 (3.93)     0.7570 (2.55)          4;0  17.7128 (0.26)         18           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q8

Before

--------------------------------------------------------------------------------- benchmark 'q8-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-8]     14.9532 (1.0)      17.2992 (1.0)      16.1906 (1.0)      0.5962 (1.22)     16.1370 (1.0)      0.8323 (1.13)         23;0  61.7642 (1.0)          60           1
test_tpch[1-parquet-python-8]       34.7310 (2.32)     52.5183 (3.04)     37.1660 (2.30)     3.2354 (6.61)     36.5454 (2.26)     2.2469 (3.04)          1;1  26.9063 (0.44)         28           1
test_tpch[1-in-memory-native-8]     44.0259 (2.94)     46.0576 (2.66)     45.0905 (2.78)     0.4898 (1.0)      45.0528 (2.79)     0.7380 (1.0)           5;0  22.1776 (0.36)         22           1
test_tpch[1-parquet-native-8]       69.8245 (4.67)     73.1332 (4.23)     71.0333 (4.39)     0.8421 (1.72)     70.8515 (4.39)     0.9827 (1.33)          3;1  14.0779 (0.23)         14           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

--------------------------------------------------------------------------------- benchmark 'q8-parts-1': 4 tests ----------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-8]      7.2145 (1.0)       8.5953 (1.0)       7.7980 (1.0)      0.3170 (1.0)       7.8074 (1.0)      0.4959 (1.13)         44;0  128.2373 (1.0)         118           1
test_tpch[1-in-memory-python-8]     15.3458 (2.13)     17.6777 (2.06)     16.3485 (2.10)     0.5217 (1.65)     16.5234 (2.12)     0.7077 (1.62)         17;0   61.1678 (0.48)         57           1
test_tpch[1-parquet-native-8]       31.8017 (4.41)     33.5568 (3.90)     32.4955 (4.17)     0.4402 (1.39)     32.4145 (4.15)     0.4376 (1.0)           9;2   30.7735 (0.24)         30           1
test_tpch[1-parquet-python-8]       46.2703 (6.41)     48.3345 (5.62)     47.6753 (6.11)     0.5142 (1.62)     47.7092 (6.11)     0.4509 (1.03)          6;2   20.9752 (0.16)         21           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q19

Before

------------------------------------------------------------------------------------ benchmark 'q19-parts-1': 4 tests -----------------------------------------------------------------------------------
Name (time in ms)                         Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-19]     277.9533 (1.0)      281.5647 (1.0)      279.8556 (1.0)       1.4607 (1.0)      279.4186 (1.0)       2.2466 (1.0)           2;0  3.5733 (1.0)           5           1
test_tpch[1-parquet-python-19]       311.1196 (1.12)     317.8069 (1.13)     315.6200 (1.13)      2.6611 (1.82)     316.7545 (1.13)      2.8849 (1.28)          1;0  3.1684 (0.89)          5           1
test_tpch[1-in-memory-native-19]     431.2738 (1.55)     464.2194 (1.65)     442.1488 (1.58)     13.3136 (9.11)     436.8320 (1.56)     16.3197 (7.26)          1;0  2.2617 (0.63)          5           1
test_tpch[1-parquet-native-19]       455.3492 (1.64)     460.8460 (1.64)     458.0333 (1.64)      2.1169 (1.45)     457.4410 (1.64)      3.0005 (1.34)          2;0  2.1832 (0.61)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

------------------------------------------------------------------------------------ benchmark 'q19-parts-1': 4 tests -----------------------------------------------------------------------------------
Name (time in ms)                         Min                 Max                Mean             StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-19]      62.6100 (1.0)       71.2192 (1.0)       66.0757 (1.0)       2.5634 (1.0)       65.6719 (1.0)      1.5398 (1.0)           6;4  15.1342 (1.0)          15           1
test_tpch[1-parquet-native-19]        94.8984 (1.52)     134.7584 (1.89)     103.1099 (1.56)     12.5712 (4.90)      97.7583 (1.49)     7.3370 (4.76)          1;1   9.6984 (0.64)          9           1
test_tpch[1-in-memory-python-19]     284.6653 (4.55)     295.5558 (4.15)     289.7268 (4.38)      3.9986 (1.56)     288.6982 (4.40)     4.6399 (3.01)          2;0   3.4515 (0.23)          5           1
test_tpch[1-parquet-python-19]       308.9599 (4.93)     319.0440 (4.48)     314.7801 (4.76)      4.2088 (1.64)     315.3274 (4.80)     7.0198 (4.56)          2;0   3.1768 (0.21)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

codspeed-hq[bot] commented 1 day ago

CodSpeed Performance Report

Merging #3327 will degrade performances by 40.27%

_{Comparing desmondcheongzx:lift-stats-to-logical-plan (849083a) with main (3394a66)}

Summary

⚡ 3 improvements
❌ 3 regressions
✅ 11 untouched benchmarks

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`desmondcheongzx:lift-stats-to-logical-plan`	Change
⚡	`test_count[1 Small File]`	3.8 ms	3.4 ms	+13.14%
❌	`test_explain[100 Small Files]`	7.2 ms	10 ms	-28.2%
❌	`test_show[100 Small Files]`	15.9 ms	26.6 ms	-40.27%
⚡	`test_tpch[1-in-memory-native-5]`	376.3 ms	312 ms	+20.6%
⚡	`test_tpch[1-in-memory-native-8]`	339.2 ms	156.9 ms	×2.2
❌	`test_tpch[1-in-memory-native-9]`	362.9 ms	451.6 ms	-19.64%

kevinzwang commented 1 day ago

@desmondcheongzx is this PR ready for review? it looks like a test is currently failing. Also is there any specific part you'd like me (and others) to take a look at?

desmondcheongzx commented 1 day ago

@kevinzwang yeah it's ready now 😅 Apparently there was one last issue hiding with the interaction between materializing stats and generator scans.

I think this is a good breakdown for parts to look at:

@kevinzwang Grabbing scan stats in src/daft-logical-plan/src/ops/source.rs
@colin-ho Translation of stats from logical -> local plan, and use of stats in the local pipeline
@samster25 would love to sanity check overall approach, in particular with the EnrichWithStats optimizer rule to materialize stats

Eventual-Inc / Daft