The multi-gpu tests always hang and NCCL cannot find CUDA

SusuXu commented 2 years ago

This is the error message when I try to run the test. We got CUDA error and NET error as below. The test finally runs into a deadlock. It runs well if using only one GPU, but fails in two gpus cases. We have make the CUDA_HOME and direct it to /usr/local/cuda. But it keeps throwing the error of not able to find CUDA.

mpirun -np 2 -H localhost:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib --allow-run-as-root ./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1

nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

 Using devices
   Rank  0 Pid  28329 on   SAIS1020 device  0 [0x01] NVIDIA RTX A6000
   Rank  1 Pid  28330 on   SAIS1020 device  1 [0x41] NVIDIA RTX A6000
SAIS1020:28329:28329 [0] NCCL INFO Bootstrap : Using enp2s0f0:hided<0>
SAIS1020:28329:28329 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
SAIS1020:28329:28329 [0] misc/cudawrap.cc:90 NCCL WARN Failed to find CUDA library in /usr/local/cuda-11.7 (NCCL_CUDA_PATH=/usr/local/cuda-11.7)
NCCL version 2.14.3+cuda11.7
SAIS1020:28329:28341 [0] NCCL INFO NET/IB : No device found.
SAIS1020:28329:28341 [0] NCCL INFO NET/Socket : Using [0]enp2s0f0:hided<0> [1]vethfe5f4ae:fe80::d850:75ff:fe81:4b74%vethfe5f4ae<0> [2]vethfe50d79:fe80::f0a0:6eff:fe19:5fce%vethfe50d79<0> [3]veth90e6240:fe80::68ae:39ff:fe86:382a%veth90e6240<0>
SAIS1020:28329:28341 [0] NCCL INFO Using network Socket

SAIS1020:28330:28330 [1] misc/cudawrap.cc:90 NCCL WARN Failed to find CUDA library in /usr/local/cuda-11.7 (NCCL_CUDA_PATH=/usr/local/cuda-11.7)
SAIS1020:28330:28330 [1] NCCL INFO Bootstrap : Using enp2s0f0:(hided)
SAIS1020:28330:28330 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
SAIS1020:28330:28343 [1] NCCL INFO NET/IB : No device found.
SAIS1020:28330:28343 [1] NCCL INFO NET/Socket : Using [0]enp2s0f0:(hided) [1]vethfe5f4ae:fe80::d850:75ff:fe81:4b74%vethfe5f4ae<0> [2]vethfe50d79:fe80::f0a0:6eff:fe19:5fce%vethfe50d79<0> [3]veth90e6240:fe80::68ae:39ff:fe86:382a%veth90e6240<0>
SAIS1020:28330:28343 [1] NCCL INFO Using network Socket
SAIS1020:28329:28341 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff
SAIS1020:28330:28343 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff
SAIS1020:28329:28341 [0] NCCL INFO Channel 00/04 :    0   1
SAIS1020:28329:28341 [0] NCCL INFO Channel 01/04 :    0   1
SAIS1020:28329:28341 [0] NCCL INFO Channel 02/04 :    0   1
SAIS1020:28329:28341 [0] NCCL INFO Channel 03/04 :    0   1
SAIS1020:28330:28343 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
SAIS1020:28329:28341 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
SAIS1020:28330:28343 [1] NCCL INFO Channel 00/0 : 1[41000] -> 0[1000] via P2P/IPC
SAIS1020:28329:28341 [0] NCCL INFO Channel 00/0 : 0[1000] -> 1[41000] via P2P/IPC
SAIS1020:28330:28343 [1] NCCL INFO Channel 01/0 : 1[41000] -> 0[1000] via P2P/IPC
SAIS1020:28329:28341 [0] NCCL INFO Channel 01/0 : 0[1000] -> 1[41000] via P2P/IPC
SAIS1020:28330:28343 [1] NCCL INFO Channel 02/0 : 1[41000] -> 0[1000] via P2P/IPC
SAIS1020:28330:28343 [1] NCCL INFO Channel 03/0 : 1[41000] -> 0[1000] via P2P/IPC
SAIS1020:28329:28341 [0] NCCL INFO Channel 02/0 : 0[1000] -> 1[41000] via P2P/IPC
SAIS1020:28329:28341 [0] NCCL INFO Channel 03/0 : 0[1000] -> 1[41000] via P2P/IPC
SAIS1020:28330:28343 [1] NCCL INFO Connected all rings
SAIS1020:28330:28343 [1] NCCL INFO Connected all trees
SAIS1020:28330:28343 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
SAIS1020:28330:28343 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
SAIS1020:28329:28341 [0] NCCL INFO Connected all rings
SAIS1020:28329:28341 [0] NCCL INFO Connected all trees
SAIS1020:28329:28341 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
SAIS1020:28329:28341 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
SAIS1020:28330:28343 [1] NCCL INFO comm 0x55b2191391f0 rank 1 nranks 2 cudaDev 1 busId 41000 - Init COMPLETE
SAIS1020:28329:28341 [0] NCCL INFO comm 0x55e7f89886a0 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
                                                              out-of-place                       in-place          
       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

@sjeaugey

SusuXu commented 2 years ago

Here is some env info. System: Ubuntu 18.4 CUDA: 11.7 NCCL: 2.14.3 GPU: 4 NVIDIA RTX A6000.

nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X  SYS SYS SYS 0-47        N/A
GPU1    SYS  X  SYS SYS 0-47        N/A
GPU2    SYS SYS  X  SYS 0-47        N/A
GPU3    SYS SYS SYS  X  0-47        N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

SusuXu commented 2 years ago

It turns out to be a bandwidth issue. After we set NCCL_P2P_DISABLE = 1, the test can run through, but with super low bandwidth. We are using AMD CPU EPYC 7402P, would that be an issue?

sjeaugey commented 2 years ago

From what you describe, it would seem that GPU Direct is not functional through the CPU. You may want to disable iommu by adding "iommu=pt" to the kernel launch arguments.

Without P2P, all traffic goes through CPU memory. On AMD CPUs, the performance of memory accesses from the GPU is heavily impacted by the NPS setting in the BIOS (NUMA nodes per socket). I'd advise to try setting NPS to 4 and see if performance is better.

SusuXu commented 2 years ago

thanks! setting amd_iommu=off helps address the communication problem, though the average bandwidth is still low (~6GB/s). Setting NPS to 4 slightly increases the bandwidth by 1GB/s.
Thanks for your help!

sjeaugey commented 2 years ago

the average bandwidth is still low (~6GB/s).

Do you mean the "Average BW" printed at the end? This has no meaning if you run with -b 8 -e 1G -f 2 as it would average bandwidths for all sizes, and bandwidth does not make sense for very small operations as they are purely latency-bound.

What is the BusBW of large operations, i.e. what is the value to which the BusBW is converging as sizes increase?

NVIDIA / nccl-tests

The multi-gpu tests always hang and NCCL cannot find CUDA #115