apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

[Research] Use custom cost model when deciding between SMJ and SHJ #1011

Open andygrove opened 1 month ago

andygrove commented 1 month ago

What is the problem the feature request solves?

Spark often prefers SortMergeJoin (SMJ) to ShuffledHashJoin (SHJ) because it is more stable (less likely to OOM) and has good performance.

However, Comet (and other vectorized Spark accelerators) tend to have better performance with SHJ. In https://github.com/apache/datafusion-comet/pull/1007 we add a configuration option that lets the user choose between SMJ and SHJ at the time Comet translates Spark's plan, but it would be better if we could automatically enable SHJ.

Spark AQE already re-optimizes each query stage leveraging basic statistics (row count / data size) from completed child query stages and will choose between SMJ and SHJ based on Spark's cost model. Spark supports custom cost models, so we should explore providing a Comet-specific cost model that can make a better choice between SMJ and SHJ.

Describe the potential solution

No response

Additional context

No response