NCCL graph and topology incompatible with A100

r-b-g-b commented 4 months ago

I'm using the ubuntu-hpc 2204 x64 Gen 2 image on a Standard NC24ads A100 v4 VM.

I train a vLLM model that uses NCCL and observe the following error:

Error

::16674:16674 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0> ::16674:16674 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ::16674:16674 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ::16674:16674 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.6+cuda11.8 ::16674:17023 [0] NCCL INFO NET/IB : Using [0]mlx5_an0:1/RoCE [RO]; OOB eth0:10.1.0.4<0> ::16674:17023 [0] NCCL INFO Using network IB ::16674:17023 [0] NCCL INFO comm 0x5640f24752d0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 100000 commId 0x2064c00fc2f91516 - Init START ::16674:17023 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /opt/microsoft/ncv4/topo.xml ::16674:17023 [0] NCCL INFO Setting affinity for GPU 0 to ffffff ::16674:17023 [0] NCCL INFO NCCL_GRAPH_FILE set by environment to /opt/microsoft/ncv4/graph.xml ::16674:17023 [0] graph/search.cc:703 NCCL WARN XML Import Channel : dev 1 not found. ::16674:17023 [0] NCCL INFO graph/search.cc:733 -> 2 ::16674:17023 [0] NCCL INFO graph/search.cc:740 -> 2 ::16674:17023 [0] NCCL INFO graph/search.cc:840 -> 2 ::16674:17023 [0] NCCL INFO init.cc:880 -> 2 ::16674:17023 [0] NCCL INFO init.cc:1358 -> 2 ::16674:17023 [0] NCCL INFO group.cc:65 -> 2 [Async thread] ::16674:16674 [0] NCCL INFO group.cc:406 -> 2 ::16674:16674 [0] NCCL INFO group.cc:96 -> 2 Traceback (most recent call last): ... self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args engine = cls(*engine_configs, File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in __init__ self._init_workers() File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 151, in _init_workers self._run_workers("init_model") File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/worker/worker.py", line 84, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/vllm/worker/worker.py", line 253, in init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/home/azureuser/miniconda3/envs/snomedct/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1702400366987/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: XML Import Channel : dev 1 not found. ::16674:16674 [0] NCCL INFO comm 0x5640f24752d0 rank 0 nranks 1 cudaDev 0 busId 100000 - Abort COMPLETE

This is a single GPU machine, but /opt/microsoft/ncv4/graph.xml and topology.xml reference 4 GPUs. If I update them to refer to a single GPU, everything works.

graph.xml

```xml ```

topology.xml

```xml ```

LiquidPT commented 4 months ago

Looking into it.

yosoyjay commented 2 months ago

I found the same issue and fix with Standard_NC48ads_A100_v4 with these images.

jithinjosepkl commented 1 month ago

Yes, the topo/graph files are not needed for the smaller NCv4 VM sizes. Next VM image release will have this not loaded automatically (in /etc/nccl.conf) for these vm sizes. Until then, please delete the reference to topo/graph files and it should be good.

Azure / azhpc-images

NCCL graph and topology incompatible with A100 #327