Spark often prefers SortMergeJoin (SMJ) to ShuffledHashJoin (SHJ) because it is more stable (less likely to OOM) and has good performance.
However, Comet (and other vectorized Spark accelerators) tend to have better performance with SHJ. In https://github.com/apache/datafusion-comet/pull/1007 we add a configuration option that lets the user choose between SMJ and SHJ at the time Comet translates Spark's plan, but it would be better if we could automatically enable SHJ.
Spark AQE already re-optimizes each query stage leveraging basic statistics (row count / data size) from completed child query stages and will choose between SMJ and SHJ based on Spark's cost model. Spark supports custom cost models, so we should explore providing a Comet-specific cost model that can make a better choice between SMJ and SHJ.
What is the problem the feature request solves?
Spark often prefers SortMergeJoin (SMJ) to ShuffledHashJoin (SHJ) because it is more stable (less likely to OOM) and has good performance.
However, Comet (and other vectorized Spark accelerators) tend to have better performance with SHJ. In https://github.com/apache/datafusion-comet/pull/1007 we add a configuration option that lets the user choose between SMJ and SHJ at the time Comet translates Spark's plan, but it would be better if we could automatically enable SHJ.
Spark AQE already re-optimizes each query stage leveraging basic statistics (row count / data size) from completed child query stages and will choose between SMJ and SHJ based on Spark's cost model. Spark supports custom cost models, so we should explore providing a Comet-specific cost model that can make a better choice between SMJ and SHJ.
Describe the potential solution
No response
Additional context
No response