NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
826 stars 236 forks source link

[BUG] Rework RapidsShuffleManager initialization for Apache Spark 4.0.0 #11107

Open gerashegalov opened 5 months ago

gerashegalov commented 5 months ago

With apache/spark#43627 we eliminate the need to add the plugin jar via spark.executor.extraClassPath and paved the way to the simplified Boolean switch useRSM=true/false. Now would be a good time to do this work. At the minimum we need to fix the NullPointerException issue resulting from the initialization order change.

Steps/Code to reproduce bug

Start a local-cluster with RSM

JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64 \
  ~/dist/spark-4.0.0-preview1-bin-hadoop3/bin/spark-shell \
  --jars scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar 
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.explain=ALL \
  --conf spark.rapids.memory.gpu.allocSize=1536m \
  --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark400.RapidsShuffleManager  \
  --master local-cluster[2,2,1024]

Note: --conf spark.executor.extraClassPath=$PWD/scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar

Run

scala> spark.range(100000).repartition(2).summary().collect()

Check the executor log

{
  "ts": "2024-06-28T21:39:27.924Z",
  "level": "ERROR",
  "msg": "Exception in the executor plugin, shutting down!",
  "exception": {
    "class": "java.lang.NullPointerException",
    "msg": "Cannot invoke \"Object.getClass()\" because \"shuffleManager\" is null",
    "stacktrace": [
      {
        "class": "org.apache.spark.sql.rapids.GpuShuffleEnv$",
        "method": "initShuffleManager",
        "file": "GpuShuffleEnv.scala",
        "line": 112
      },
      {
        "class": "com.nvidia.spark.rapids.RapidsExecutorPlugin",
        "method": "init",
        "file": "Plugin.scala",
        "line": 551
      },
      {
        "class": "org.apache.spark.internal.plugin.ExecutorPluginContainer",
        "method": "$anonfun$executorPlugins$1",
        "file": "PluginContainer.scala",
        "line": 125
      },
...
    ]
  },
  "logger": "RapidsExecutorPlugin"
}

Additional context

[SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order https://github.com/razajafri/spark-rapids/pull/3

abellina commented 5 months ago

Thanks for filing this. I do not know why we got an NPE here, I didn't get one when I tested the apache issue, so I am worried now that there's a bug somewhere.

gerashegalov commented 5 months ago

Our plugin init code currently assumes that the lazy shuffle manager instance SparkEnv.get.shuffleManager has already been created and set, to execute some validation and initialization. Now that the order of SM instantiation and Plugin initialization is reversed in 4.0.0 we need to do validation steps without assuming the an instance in the ExecutorDriver init and set some flag to force eager initialization at the SM instantiation time. I think we can write this code without shimming, but worst case with shimming.