Open gerashegalov opened 5 months ago
Thanks for filing this. I do not know why we got an NPE here, I didn't get one when I tested the apache issue, so I am worried now that there's a bug somewhere.
Our plugin init code currently assumes that the lazy shuffle manager instance SparkEnv.get.shuffleManager
has already been created and set, to execute some validation and initialization. Now that the order of SM instantiation and Plugin initialization is reversed in 4.0.0 we need to do validation steps without assuming the an instance in the ExecutorDriver init and set some flag to force eager initialization at the SM instantiation time. I think we can write this code without shimming, but worst case with shimming.
With apache/spark#43627 we eliminate the need to add the plugin jar via
spark.executor.extraClassPath
and paved the way to the simplified Boolean switch useRSM=true/false. Now would be a good time to do this work. At the minimum we need to fix theNullPointerException
issue resulting from the initialization order change.Steps/Code to reproduce bug
Start a local-cluster with RSM
Note:
--conf spark.executor.extraClassPath=$PWD/scala2.13/dist/target/rapids-4-spark_2.13-24.08.0-SNAPSHOT-cuda11.jar
Run
Check the executor log
Additional context
[SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order https://github.com/razajafri/spark-rapids/pull/3