NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[BUG] Some queries fail when cost-based optimizations are enabled #1899

Closed andygrove closed 3 years ago

andygrove commented 3 years ago

Describe the bug With the experimental cost-based optimizer enabled, 23 of the NDS queries fail due to inconsistent joins (incompatible mix of CPU/GPU operators).

The queries that fail are q7, q9, q26, q27, q28, q30, q32, q36, q44, q59, q81, q92, q1, q6, q10, q54, q85, q94, q11, q13, q16, q23a, q35

andygrove commented 3 years ago

q6 fails with this when running against Spark 3.1.1 but works with Spark 3.0.2 (with AQE and RAPIDS CBO enabled in both cases)

java.util.NoSuchElementException: key not found: numPartitions
        at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:101)
        at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:99)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.sendDriverMetrics(CustomShuffleReaderExec.scala:122)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.shuffleRDD$lzycompute(CustomShuffleReaderExec.scala:182)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.shuffleRDD(CustomShuffleReaderExec.scala:181)
        at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.doExecuteColumnar(CustomShuffleReaderExec.scala:196)
andygrove commented 3 years ago

The q6 error above was misleading. There is a regression in Spark 3.1.1 with error handling related to executing on a canonicalized plan. I filed https://issues.apache.org/jira/browse/SPARK-34682.

andygrove commented 3 years ago

Most of these failures are due to a single issue. CBO is sometimes forcing a GPU CustomShuffleReaderExec back onto CPU, making it incompatible with the GPU shuffle that already happened.

sameerz commented 3 years ago

@andygrove is this resolved with #1910 ?

andygrove commented 3 years ago

@andygrove is this resolved with #1910 ?

@sameerz No, but it is resolved by https://github.com/NVIDIA/spark-rapids/pull/1954