NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
56 stars 37 forks source link

[BUG] Shuffle partition look up in AutoTuner throws exception for Databricks event logs with auto optimized shuffle. #1404

Open parthosa opened 3 weeks ago

parthosa commented 3 weeks ago

Describe the bug Currently, in AutoTuner, while recommending shuffle partitions, we read the existing value of spark.sql.shuffle.partitions and convert it to Integer. However, for databricks event logs this value may be auto. In that case, a NumberFormatException is thrown.

Detailed Output

    | java.lang.NumberFormatException: For input string: "auto"
    |   at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_422]
    |   at java.lang.Integer.parseInt(Integer.java:580) ~[?:1.8.0_422]
    |   at java.lang.Integer.parseInt(Integer.java:615) ~[?:1.8.0_422]
    |   at scala.collection.immutable.StringLike.toInt(StringLike.scala:310) ~[scala-library-2.12.18.jar:?]
    |   at scala.collection.immutable.StringLike.toInt$(StringLike.scala:310) ~[scala-library-2.12.18.jar:?]
    |   at scala.collection.immutable.StringOps.toInt(StringOps.scala:33) ~[scala-library-2.12.18.jar:?]
    |   at com.nvidia.spark.rapids.tool.profiling.AutoTuner.recommendShufflePartitions(AutoTuner.scala:1008) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.profiling.AutoTuner.calculateJobLevelRecommendations(AutoTuner.scala:723) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.profiling.AutoTuner.getRecommendedProperties(AutoTuner.scala:1163) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.tuning.QualificationAutoTuner.runAutoTuner(QualificationAutoTuner.scala:70) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.tuning.TunerContext$$anonfun$tuneApplication$1.$anonfun$applyOrElse$1(TunerContext.scala:60) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.18.jar:?]
    |   at com.nvidia.spark.rapids.tool.tuning.TunerContext$$anonfun$tuneApplication$1.applyOrElse(TunerContext.scala:60) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.tuning.TunerContext$$anonfun$tuneApplication$1.applyOrElse(TunerContext.scala:57) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:228) ~[scala-library-2.12.18.jar:?]
    |   at scala.PartialFunction$Lifted.apply(PartialFunction.scala:224) ~[scala-library-2.12.18.jar:?]
    |   at scala.Option.collect(Option.scala:432) ~[scala-library-2.12.18.jar:?]
    |   at com.nvidia.spark.rapids.tool.tuning.TunerContext.tuneApplication(TunerContext.scala:57) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.qualification.Qualification.$anonfun$qualifyApp$6(Qualification.scala:184) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.18.jar:?]
    |   at com.nvidia.spark.rapids.tool.qualification.Qualification.$anonfun$qualifyApp$5(Qualification.scala:179) ~[rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [scala-library-2.12.18.jar:?]
    |   at com.nvidia.spark.rapids.tool.qualification.AppSubscriber$.withSafeValidAttempt(AppSubscriber.scala:57) [rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.qualification.Qualification.com$nvidia$spark$rapids$tool$qualification$Qualification$$qualifyApp(Qualification.scala:178) [rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at com.nvidia.spark.rapids.tool.qualification.Qualification$QualifyThread.run(Qualification.scala:50) [rapids-4-spark-tools_2.12-24.08.3-SNAPSHOT.jar:?]
    |   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_422]
    |   at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_422]
    |   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_422]
    |   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_422]
    |   at java.lang.Thread.run(Thread.java:750) [?:1.8.0_422]

Expected Behavior The exception should not be thrown

Additional Context Auto Optimized Shuffle - https://docs.databricks.com/en/optimizations/aqe.html#enable-auto-optimized-shuffle