NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
54 stars 37 forks source link

[FEA] Enable AQE `autoBroadcastJoinThreshold` configuration recommendation in Auto-tuner #719

Open cindyyuanjiang opened 10 months ago

cindyyuanjiang commented 10 months ago

Is your feature request related to a problem? Please describe. Follow up issue for PR #688

We need to investigate further on spark.sql.adaptive.autoBroadcastJoinThreshold in order to make an accurate recommendation.

cindyyuanjiang commented 10 months ago

Thanks @viadea for the comment: "There are some hard rules we can consider when should we use BHJ (BroadcastHashJoin) vs SHJ (ShuffledHashJoin) in the GPU mode. Those hard rules can be implemented in our Profiling tool as well. For example:

  1. BHJ does not support full outer join. See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
  2. The maximum broadcast table should be smaller than 8G. See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala.

As a result, we can check the join type and data size on the smaller side to determine if we should promote the BHJ or not by setting a larger spark.sql.adaptive.autoBroadcastJoinThreshold."