Open bryant1410 opened 1 month ago
Thank you for raising an issue. I don't know what is the right solution either. As for a workaround, one can either use conda/mamba, or install xgboost without dependency (pip install --no-deps
), then pickup these dependencies later.
The binary wheel itself can use cu11
, it's just how to specify it in the pyproject.toml
.
May I ask what specific problem you ran into?
After doing a PyTorch distributed mode init, when doing a broadcast_object_list
, I ran into some weird NCCL error:
File "file.py", line 2, in function
dist.broadcast_object_list(objects, src=src)
File "env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2901, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
File "env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2205, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.22.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'CUDA driver version is insufficient for CUDA runtime version'
It got fixed after removing nvidia-nccl-cu12
installed by xgboost installation.
XGBoost for Python depends on
nvidia-nccl-cu12
, which is for CUDA 12. I have a PyTorch 2.4.0 installation for CUDA 11.8, but when I use distributed mode, PyTorch picks up on the one installed by XGBoost for NCCL and it gives me problems for my environment.My workaround is to install the CPU-only version of XGBoost. However, I still want to use XGBoost with CUDA support. It'd be nice if I could use it with
nvidia-nccl-cu11
instead. Not sure what the solution could be (maybe optional groups of dependencies for XGBoost, such as acu11
one, etc; or a different package). Note this could be a future problem when CUDA 13 comes out as well.