NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

NCCL panic with nccl-test with 2 GPUs inside kubevirt VM #1373

Open winsopc opened 4 months ago

winsopc commented 4 months ago

I have a setup with 2 nodes with kubevirt VMs running with 2 gpus,
` mpirun --allow-run-as-root --show-progress -H 10.194.9.3,10.194.10.5 -map-by node -np 2 -x PATH -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=WARN -x NCCL_TOPO_DUMP_FILE=system.txt -x NCCL_SOCKET_IFNAME=net2 -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_DEBUG_FILE=/root/nccl.log /root/nccl-tests/build/all_reduce_perf -b 8M -e 16M -f 2 -g 2 Warning: Permanently added '10.194.10.5' (ED25519) to the list of known hosts. Authorized uses only. All activity may be monitored and reported. App launch reported: 2 (out of 2) daemons - 1 (out of 2) procs test output: https://[gist.githubusercontent.com/winsopc/d72d093998c36c7e4b5f26e70bf5156b/raw/e479c7afe25aca25fe566c81bf912fce37512ce3/gistfile1.txt](https://gist.githubusercontent.com/winsopc/d72d093998c36c7e4b5f26e70bf5156b/raw/e479c7afe25aca25fe566c81bf912fce37512ce3/gistfile1.txt)

winsopc commented 4 months ago

when I run the same test, with 1 gpu per kvm. e.g.

#  Rank  0 Group  0 Pid  21033 on node-001-a device  0 [0x07] NVIDIA GeForce RTX 4080
#  Rank  1 Group  0 Pid  18934 on node-001-b device  0 [0x07] NVIDIA GeForce RTX 4080
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     8388608       2097152     float     sum      -1   5533.5    1.52    1.52      0   3442.7    2.44    2.44      0
    16777216       4194304     float     sum      -1   6679.9    2.51    2.51      0   6660.2    2.52    2.52      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.24581
winsopc commented 4 months ago

not sure if it's due to the kvm with gpu rdma support? on bm node, there 2 numa node, each gpu on one numa node.

AddyLaddy commented 4 months ago

I think we'd need to see the output with NCCL_DEBUG=INFO to be able to help

winsopc commented 4 months ago
mpirun --allow-run-as-root --show-progress -H 10.194.9.3,10.194.10.5 -map-by node -np 2 -x PATH -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_TOPO_DUMP_FILE=system.txt -x NCCL_SOCKET_IFNAME=net2 -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_DEBUG_FILE=/root/nccl.log /root/nccl-tests/build/all_reduce_perf -b 8M -e 16M -f 2 -g 2
Warning: Permanently added '10.194.10.5' (ED25519) to the list of known hosts.
Authorized uses only. All activity may be monitored and reported.
App launch reported: 2 (out of 2) daemons - 1 (out of 2) procs
# nThread 1 nGpus 2 minBytes 8388608 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  21356 on node-001-a device  0 [0x07] NVIDIA GeForce RTX 4080
#  Rank  1 Group  0 Pid  21356 on node-001-a device  1 [0x08] NVIDIA GeForce RTX 4080
#  Rank  2 Group  0 Pid  19447 on node-001-b device  0 [0x07] NVIDIA GeForce RTX 4080
#  Rank  3 Group  0 Pid  19447 on node-001-b device  1 [0x08] NVIDIA GeForce RTX 4080
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
[node-001-b:19447] *** Process received signal ***
[node-001-b:19447] Signal: Segmentation fault (11)
[node-001-b:19447] Signal code: Address not mapped (1)
[node-001-b:19447] Failing at address: (nil)
[node-001-b:19447] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f41acf2b520]
[node-001-b:19447] [ 1] /root/nccl/build/lib/libnccl.so.2(+0x50dc2)[0x7f41ad475dc2]
[node-001-b:19447] [ 2] /root/nccl/build/lib/libnccl.so.2(+0x51527)[0x7f41ad476527]
[node-001-b:19447] [ 3] /root/nccl/build/lib/libnccl.so.2(+0x52703)[0x7f41ad477703]
[node-001-b:19447] [ 4] /root/nccl/build/lib/libnccl.so.2(ncclGroupEnd+0x6e)[0x7f41ad477eee]
[node-001-b:19447] [ 5] /root/nccl-tests/build/all_reduce_perf(+0x8e96)[0x55cee915ee96]
[node-001-b:19447] [ 6] /root/nccl-tests/build/all_reduce_perf(+0xd5d3)[0x55cee91635d3]
[node-001-b:19447] [ 7] /root/nccl-tests/build/all_reduce_perf(+0x6ca4)[0x55cee915cca4]
[node-001-b:19447] [ 8] /root/nccl-tests/build/all_reduce_perf(+0x7460)[0x55cee915d460]
[node-001-b:19447] [ 9] /root/nccl-tests/build/all_reduce_perf(+0xb4c6)[0x55cee91614c6]
[node-001-b:19447] [10] /root/nccl-tests/build/all_reduce_perf(+0x41ab)[0x55cee915a1ab]
[node-001-b:19447] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f41acf12d90]
[node-001-b:19447] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f41acf12e40]
[node-001-b:19447] [13] /root/nccl-tests/build/all_reduce_perf(+0x6a35)[0x55cee915ca35]
[node-001-b:19447] *** End of error message ***
[node-001-a:21356] *** Process received signal ***
[node-001-a:21356] Signal: Segmentation fault (11)
[node-001-a:21356] Signal code: Address not mapped (1)
[node-001-a:21356] Failing at address: (nil)
[node-001-a:21356] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7feae8c62520]
[node-001-a:21356] [ 1] /root/nccl/build/lib/libnccl.so.2(+0x50dc2)[0x7feae91acdc2]
[node-001-a:21356] [ 2] /root/nccl/build/lib/libnccl.so.2(+0x51527)[0x7feae91ad527]
[node-001-a:21356] [ 3] /root/nccl/build/lib/libnccl.so.2(+0x52703)[0x7feae91ae703]
[node-001-a:21356] [ 4] /root/nccl/build/lib/libnccl.so.2(ncclGroupEnd+0x6e)[0x7feae91aeeee]
[node-001-a:21356] [ 5] /root/nccl-tests/build/all_reduce_perf(+0x8e96)[0x5598db51ce96]
[node-001-a:21356] [ 6] /root/nccl-tests/build/all_reduce_perf(+0xd5d3)[0x5598db5215d3]
[node-001-a:21356] [ 7] /root/nccl-tests/build/all_reduce_perf(+0x6ca4)[0x5598db51aca4]
[node-001-a:21356] [ 8] /root/nccl-tests/build/all_reduce_perf(+0x7460)[0x5598db51b460]
[node-001-a:21356] [ 9] /root/nccl-tests/build/all_reduce_perf(+0xb4c6)[0x5598db51f4c6]
[node-001-a:21356] [10] /root/nccl-tests/build/all_reduce_perf(+0x41ab)[0x5598db5181ab]
[node-001-a:21356] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7feae8c49d90]
[node-001-a:21356] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7feae8c49e40]
[node-001-a:21356] [13] /root/nccl-tests/build/all_reduce_perf(+0x6a35)[0x5598db51aa35]
[node-001-a:21356] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node node-001-a exited on signal 11 (Segmentation fault).
kiskra-nvidia commented 4 months ago

We need to see the content of the /root/nccl.log files (where all the debug info went). Probably also the system.txt topo file. The output of nvidia-smi topo -m might also come in handy. Thanks!

winsopc commented 4 months ago

cat nccl.log
node-001-b:19447:19447 [0] NCCL INFO cudaDriverVersion 12050
node-001-b:19447:19447 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to net2
node-001-b:19447:19447 [0] NCCL INFO Bootstrap : Using net2:10.194.10.5<0>
node-001-b:19447:19447 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
node-001-b:19447:19454 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
node-001-b:19447:19454 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to net2
node-001-b:19447:19454 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [RO]; OOB net2:10.194.10.5<0>
node-001-b:19447:19454 [1] NCCL INFO Using network IB
node-001-b:19447:19453 [0] NCCL INFO Using network IB
node-001-b:19447:19454 [1] NCCL INFO ncclCommInitRank comm 0x55ceecc88fb0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 8000 commId 0xe12be0f58c2ac916 - Init START
node-001-b:19447:19453 [0] NCCL INFO ncclCommInitRank comm 0x55ceecc53070 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe12be0f58c2ac916 - Init START
node-001-b:19447:19454 [1] NCCL INFO comm 0x55ceecc88fb0 rank 3 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
node-001-b:19447:19454 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
node-001-b:19447:19454 [1] NCCL INFO P2P Chunksize set to 131072
node-001-b:19447:19453 [0] NCCL INFO comm 0x55ceecc53070 rank 2 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
node-001-b:19447:19453 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
node-001-b:19447:19453 [0] NCCL INFO P2P Chunksize set to 131072
node-001-b:19447:19453 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
node-001-b:19447:19453 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
node-001-b:19447:19454 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
node-001-b:19447:19454 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
node-001-b:19447:19454 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
node-001-b:19447:19454 [1] NCCL INFO ncclCommInitRank comm 0x55ceecc88fb0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 8000 commId 0xe12be0f58c2ac916 - Init COMPLETE
node-001-b:19447:19454 [1] NCCL INFO Init timings: rank 3 nranks 4 total 0.46 (kernels 0.35, bootstrap 0.08, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.00, rest 0.00)
node-001-b:19447:19453 [0] NCCL INFO ncclCommInitRank comm 0x55ceecc53070 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe12be0f58c2ac916 - Init COMPLETE
node-001-b:19447:19453 [0] NCCL INFO Init timings: rank 2 nranks 4 total 0.46 (kernels 0.37, bootstrap 0.06, allgathers 0.01, topo 0.02, graphs 0.00, connections 0.00, rest 0.00)
node-001-b:19447:19462 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/IB/0
node-001-b:19447:19462 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/IB/0
node-001-b:19447:19462 [0] NCCL INFO Channel 00/0 : 2[0] -> 3[1] via P2P/direct pointer
node-001-b:19447:19462 [0] NCCL INFO Channel 01/0 : 2[0] -> 3[1] via P2P/direct pointer
node-001-b:19447:19461 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/IB/1
node-001-b:19447:19461 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/IB/1
node-001-b:19447:19457 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.

node-001-b:19447:19458 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out errno 110
node-001-b:19447:19458 [0] NCCL INFO transport/net_ib.cc:1011 -> 2
node-001-b:19447:19458 [0] NCCL INFO transport/net_ib.cc:1354 -> 2
node-001-b:19447:19458 [0] NCCL INFO transport/net.cc:850 -> 2
node-001-b:19447:19462 [0] NCCL INFO transport/net.cc:414 -> 2
node-001-b:19447:19462 [0] NCCL INFO transport.cc:184 -> 2
node-001-b:19447:19462 [0] NCCL INFO transport/generic.cc:11 -> 2
node-001-b:19447:19462 [0] NCCL INFO group.cc:143 -> 2
node-001-b:19447:19462 [0] NCCL INFO group.cc:70 -> 2 [Async thread]
```cat nccl.log
node-001-b:19447:19447 [0] NCCL INFO cudaDriverVersion 12050
node-001-b:19447:19447 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to net2
node-001-b:19447:19447 [0] NCCL INFO Bootstrap : Using net2:10.194.10.5<0>
node-001-b:19447:19447 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
node-001-b:19447:19454 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
node-001-b:19447:19454 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to net2
node-001-b:19447:19454 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [RO]; OOB net2:10.194.10.5<0>
node-001-b:19447:19454 [1] NCCL INFO Using network IB
node-001-b:19447:19453 [0] NCCL INFO Using network IB
node-001-b:19447:19454 [1] NCCL INFO ncclCommInitRank comm 0x55ceecc88fb0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 8000 commId 0xe12be0f58c2ac916 - Init START
node-001-b:19447:19453 [0] NCCL INFO ncclCommInitRank comm 0x55ceecc53070 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe12be0f58c2ac916 - Init START
node-001-b:19447:19454 [1] NCCL INFO comm 0x55ceecc88fb0 rank 3 nRanks 4 nNodes 2 localRanks 2 localRank 1 MNNVL 0
node-001-b:19447:19454 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
node-001-b:19447:19454 [1] NCCL INFO P2P Chunksize set to 131072
node-001-b:19447:19453 [0] NCCL INFO comm 0x55ceecc53070 rank 2 nRanks 4 nNodes 2 localRanks 2 localRank 0 MNNVL 0
node-001-b:19447:19453 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
node-001-b:19447:19453 [0] NCCL INFO P2P Chunksize set to 131072
node-001-b:19447:19453 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
node-001-b:19447:19453 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
node-001-b:19447:19454 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
node-001-b:19447:19454 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
node-001-b:19447:19454 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
node-001-b:19447:19454 [1] NCCL INFO ncclCommInitRank comm 0x55ceecc88fb0 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 8000 commId 0xe12be0f58c2ac916 - Init COMPLETE
node-001-b:19447:19454 [1] NCCL INFO Init timings: rank 3 nranks 4 total 0.46 (kernels 0.35, bootstrap 0.08, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.00, rest 0.00)
node-001-b:19447:19453 [0] NCCL INFO ncclCommInitRank comm 0x55ceecc53070 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe12be0f58c2ac916 - Init COMPLETE
node-001-b:19447:19453 [0] NCCL INFO Init timings: rank 2 nranks 4 total 0.46 (kernels 0.37, bootstrap 0.06, allgathers 0.01, topo 0.02, graphs 0.00, connections 0.00, rest 0.00)
node-001-b:19447:19462 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/IB/0
node-001-b:19447:19462 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/IB/0
node-001-b:19447:19462 [0] NCCL INFO Channel 00/0 : 2[0] -> 3[1] via P2P/direct pointer
node-001-b:19447:19462 [0] NCCL INFO Channel 01/0 : 2[0] -> 3[1] via P2P/direct pointer
node-001-b:19447:19461 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/IB/1
node-001-b:19447:19461 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/IB/1
node-001-b:19447:19457 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.

node-001-b:19447:19458 [0] misc/ibvwrap.cc:206 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out errno 110
node-001-b:19447:19458 [0] NCCL INFO transport/net_ib.cc:1011 -> 2
node-001-b:19447:19458 [0] NCCL INFO transport/net_ib.cc:1354 -> 2
node-001-b:19447:19458 [0] NCCL INFO transport/net.cc:850 -> 2
node-001-b:19447:19462 [0] NCCL INFO transport/net.cc:414 -> 2
node-001-b:19447:19462 [0] NCCL INFO transport.cc:184 -> 2
node-001-b:19447:19462 [0] NCCL INFO transport/generic.cc:11 -> 2
node-001-b:19447:19462 [0] NCCL INFO group.cc:143 -> 2
node-001-b:19447:19462 [0] NCCL INFO group.cc:70 -> 2 [Async thread]
winsopc commented 4 months ago
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-15    0               N/A
GPU1    PHB      X      PHB     PHB     0-15    0               N/A
NIC0    PHB     PHB      X      PHB
NIC1    PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
winsopc commented 4 months ago
ibstat
CA 'mlx5_0'
        CA type: MT4126
        Number of ports: 1
        Firmware version: 22.35.3502
        Hardware version: 0
        Node GUID: 0x0a580afffec10803
        System image GUID: 0xa088c203009342f4
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 25
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x08580afffec10803
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4126
        Number of ports: 1
        Firmware version: 22.35.3502
        Hardware version: 0
        Node GUID: 0x0a580afffec20903
        System image GUID: 0xa088c203009342f4
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 25
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x08580afffec20903
                Link layer: Ethernet
winsopc commented 4 months ago

from the message, it panic happen when gpu start rdma traffic to another node which physical pcie slot on another pcie switch. here is ofed pakcages I installed inside the ubuntu 2204 kubevirt vm.

iser-dkms/5.8-4.1.5.0,now 5.8-OFED.5.8.4.1.4.1 all [installed,automatic]
isert-dkms/5.8-4.1.5.0,now 5.8-OFED.5.8.4.1.4.1 all [installed,automatic]
mlnx-ofed-kernel-dkms/5.8-4.1.5.0,now 5.8-OFED.5.8.4.1.5.1 all [installed]
mlnx-ofed-kernel-utils/5.8-4.1.5.0,now 5.8-OFED.5.8.4.1.5.1 amd64 [installed]
ofed-scripts/5.8-4.1.5.0,now 5.8-OFED.5.8.4.1.5 amd64 [installed,automatic]
srp-dkms/5.8-4.1.5.0,now 5.8-OFED.5.8.4.1.4.1 all [installed,automatic]
root@nd-zrsjc6g-br2-001-a:~# uname -r
5.15.0-116-generic
root@nd-zrsjc6g-br2-001-a:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
AddyLaddy commented 4 months ago

Looks like connectivity issues between the nodes via the RoCE network. Can you run ib_write_bw (perftests) between each node and over each NIC to confirm they can communicate via RDMA? Also, is NCCL_IB_GID_INDEX=3 correct on all nodes? NCCL now has automatic GID detection so maybe that env var could be dropped?