dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.13k stars 8.7k forks source link

Conflict with a CUDA-11 PyTorch installation #10729

Open bryant1410 opened 1 month ago

bryant1410 commented 1 month ago

XGBoost for Python depends on nvidia-nccl-cu12, which is for CUDA 12. I have a PyTorch 2.4.0 installation for CUDA 11.8, but when I use distributed mode, PyTorch picks up on the one installed by XGBoost for NCCL and it gives me problems for my environment.

My workaround is to install the CPU-only version of XGBoost. However, I still want to use XGBoost with CUDA support. It'd be nice if I could use it with nvidia-nccl-cu11 instead. Not sure what the solution could be (maybe optional groups of dependencies for XGBoost, such as a cu11 one, etc; or a different package). Note this could be a future problem when CUDA 13 comes out as well.

trivialfis commented 1 month ago

Thank you for raising an issue. I don't know what is the right solution either. As for a workaround, one can either use conda/mamba, or install xgboost without dependency (pip install --no-deps), then pickup these dependencies later.

The binary wheel itself can use cu11, it's just how to specify it in the pyproject.toml.

trivialfis commented 1 month ago

May I ask what specific problem you ran into?

bryant1410 commented 1 month ago

After doing a PyTorch distributed mode init, when doing a broadcast_object_list, I ran into some weird NCCL error:

  File "file.py", line 2, in function
    dist.broadcast_object_list(objects, src=src)
  File "env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.22.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'

It got fixed after removing nvidia-nccl-cu12 installed by xgboost installation.