dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.13k stars 8.7k forks source link

Hang inside NCCL. #10312

Open trivialfis opened 4 months ago

trivialfis commented 4 months ago

https://buildkite.com/xgboost/xgboost-ci-multi-gpu/builds/5027#018f9e76-44ae-4979-bd6d-c9aa5e0a617d

We ran into something like this before we moved into process-based multi-GPU training. The issue appears to be back now we have thread-based tests again.

trivialfis commented 4 months ago

CUDA is not quite friendly to valgrind, need a different method to debug this.

trivialfis commented 3 weeks ago

Not exactly an XGBoost issue, but we still need a workaround.