Closed andygrove closed 3 years ago
q6 fails with this when running against Spark 3.1.1 but works with Spark 3.0.2 (with AQE and RAPIDS CBO enabled in both cases)
java.util.NoSuchElementException: key not found: numPartitions
at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:101)
at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:99)
at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.sendDriverMetrics(CustomShuffleReaderExec.scala:122)
at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.shuffleRDD$lzycompute(CustomShuffleReaderExec.scala:182)
at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.shuffleRDD(CustomShuffleReaderExec.scala:181)
at org.apache.spark.sql.execution.adaptive.CustomShuffleReaderExec.doExecuteColumnar(CustomShuffleReaderExec.scala:196)
The q6 error above was misleading. There is a regression in Spark 3.1.1 with error handling related to executing on a canonicalized plan. I filed https://issues.apache.org/jira/browse/SPARK-34682.
Most of these failures are due to a single issue. CBO is sometimes forcing a GPU CustomShuffleReaderExec back onto CPU, making it incompatible with the GPU shuffle that already happened.
@andygrove is this resolved with #1910 ?
@andygrove is this resolved with #1910 ?
@sameerz No, but it is resolved by https://github.com/NVIDIA/spark-rapids/pull/1954
Describe the bug With the experimental cost-based optimizer enabled, 23 of the NDS queries fail due to inconsistent joins (incompatible mix of CPU/GPU operators).
The queries that fail are
q7, q9, q26, q27, q28, q30, q32, q36, q44, q59, q81, q92, q1, q6, q10, q54, q85, q94, q11, q13, q16, q23a, q35