XGBoost4j-spark CrossValidation train FAILED on multi-GPU environment: : Multiple processes running on same CUDA device is not supported!

Run latest 2.1.0-SNAPSHOT XGBoost4j-spark CrossValidation train along with spark plgun rapids-4-spark 24.06.0-SNAPSHOT on multi-GPU environment,

got train FAILed : Multiple processes running on same CUDA device is not supported!

ENV:

2.1.0-SNAPSHOT XGBoost4j-spark
rapids-4-spark 24.06.0-SNAPSHOT
4 GPU with 2 workers(2X2 GPUs) spark standalone cluster
App: Mortgage JVM CrossValidation train.

Detailed log attached :

driver.log

executor.log

/workspace/src/collective/nccl_device_communicator.cu:45: Check failed: n_uniques == world_size_ (2 vs. 4) : Multiple processes within communication group running on same CUDA device is not supported. 63ecb99c4c7bb16d1cf28df4f220cdc9

Stack trace:
(0) /raid/tmp/libxgboost4j563501428309033116.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f54c71103ae]
(1) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::collective::NcclDeviceCommunicator::NcclDeviceCommunicator(int, bool, xgboost::StringView)+0x7ba) [0x7f54c78eafea]
(2) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::collective::Communicator::GetDevice(int)+0xf1) [0x7f54c78e58d1]
(3) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::common::SketchContainer::AllReduce(xgboost::Context const*, bool)+0x3cb) [0x7f54c79530cb]
(4) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::common::SketchContainer::MakeCuts(xgboost::Context const*, xgboost::common::HistogramCuts*, bool)+0xc1) [0x7f54c7953c01]
(5) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::data::IterativeDMatrix::InitFromCUDA(xgboost::Context const*, xgboost::BatchParam const&, void*, float, std::shared_ptr<xgboost::DMatrix>)+0x1d9b) [0x7f54c79eb64b]
(6) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)+0x584) [0x7f54c7501164]
(7) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)+0x77) [0x7f54c74ab7a7]
(8) /raid/tmp/libxgboost4j563501428309033116.so(XGQuantileDMatrixCreateFromCallback+0x1c8) [0x7f54c7142e28]

        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:26)
        at ml.dmlc.xgboost4j.scala.QuantileDMatrix.<init>(QuantileDMatrix.scala:36)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildDMatrix(GpuPreXGBoost.scala:552)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildWatches$1(GpuPreXGBoost.scala:507)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuUtils$.time(GpuUtils.scala:140)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildWatches(GpuPreXGBoost.scala:507)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildRDDWatches$4(GpuPreXGBoost.scala:484)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildWatchesAndCheck(XGBoost.scala:409)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:440)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:540)
        at scala.Option.map(Option.scala:230)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:539)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

dmlc / xgboost

XGBoost4j-spark CrossValidation train FAILED on multi-GPU environment: : Multiple processes running on same CUDA device is not supported! #10200