apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
826 stars 164 forks source link

perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin #1007

Closed andygrove closed 1 month ago

andygrove commented 1 month ago

Which issue does this PR close?

Closes https://github.com/apache/datafusion-comet/issues/1006

Rationale for this change

Improved performance

What changes are included in this PR?

How are these changes tested?

I manually ran TPC-H and saw improved performance. I will post benchmarks once I have run more tests.

andygrove commented 1 month ago

Here is a teaser for the performance improvement. This is for TPC-H q11 (SF=100) with broadcast joins disabled (I am looking into a regression with those). I ran the query 5 times each with rule enabled vs disabled.

Rule Off

79.87537693977356,
77.76734256744385,
75.35734295845032,
75.44863200187683,
72.88174152374268

Rule On

39.33945274353027,
36.159271240234375,
35.83299708366394,
35.638232707977295,
35.67777371406555
parthchandra commented 1 month ago

There is a small danger in enabling this without having a good estimate of the size of the build side. ShuffleHashJoin has limits on how much data it can process efficiently. If the build side hash table has no spilling then a large enough build side will cause OOMs and if there is spilling, then SMJ can frequently lead to better performance. We might even see this when we scale the benchmark from SF1 to say SF10. Is there a way for us to get cardinality and row size for the build side somehow? Still worth adding this option though.

parthchandra commented 1 month ago

if there is spilling, then SMJ can frequently lead to better performance I have seen this happen with Spark with some TPC-DS queries at SF10.

andygrove commented 1 month ago

Current benchmarks:

tpch_allqueries

Speedup of using HashJoin instead of SortMergeJoin:

tpch_queries_speedup

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 19.44444% with 29 lines in your changes missing coverage. Please review.

Project coverage is 34.27%. Comparing base (e3ac6cf) to head (1073517). Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ain/scala/org/apache/comet/rules/RewriteJoin.scala 0.00% 26 Missing :warning:
...org/apache/comet/CometSparkSessionExtensions.scala 40.00% 2 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1007 +/- ## ============================================ - Coverage 34.41% 34.27% -0.14% + Complexity 886 881 -5 ============================================ Files 112 113 +1 Lines 43479 43514 +35 Branches 9656 9663 +7 ============================================ - Hits 14962 14914 -48 - Misses 25442 25510 +68 - Partials 3075 3090 +15 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andygrove commented 1 month ago

I will add documentation to this PR today, explaining pros/cons of this feature in our tuning guide.

andygrove commented 1 month ago

@viirya @parthchandra This is now ready for review. The new option is disabled by default and I added a section to the tuning guide explaining why users may want to enable this new option.

andygrove commented 1 month ago

I have run into a deadlock when running TPC-DS benchmarks with this feature, so I am moving to draft while I investigate. It is possibly related to the memory pool issues that we are also working on in other PRs.

andygrove commented 1 month ago

After upmerging, I no longer see the deadlock, but instead get an error if I have insufficient memory allocated, which is an improvement.

org.apache.comet.CometNativeException (External error: Internal error: 

Partition is still not able to allocate enough memory for the array builders after spilling..

However, when I increase memory, I see queries fail due to https://github.com/apache/datafusion-comet/issues/1019.

andygrove commented 1 month ago

I have now marked the feature as experimental and explained in the tuning guide that there is no spill to disk so this could result in OOM.

andygrove commented 1 month ago

Fresh benchmarks after upmerging.

tpch_allqueries

andygrove commented 1 month ago

TPC-DS excluding q97 (OOM with ShuffledHashJoin).

tpcds_allqueries