NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Test fail caused by ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out. #227

Closed thsmfe001 closed 1 month ago

thsmfe001 commented 1 month ago

Two container on same server was test good. But Two container on different server was test fail. Also docker based test was good in spite of using different servers, After changed K8S, below issue was happended. Below is logs from two cases.

  1. Two container on same server

mpirun -np 2 -x UCX_TLS=tcp -allow-run-as-root -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME-=net1 -host 10.10.10.16,10.10.11.3,10.10.20.2,10.10.21.2 /workspace/software/nccl-tests-master/build/reduce_perf -b 8 -e 128M -f 1 -g 1

nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 3543 on ai-master device 0 [0xca] NVIDIA L40

Rank 1 Group 0 Pid 5725 on ai-worker01 device 0 [0xca] NVIDIA L40

ai-master:3543:3543 [0] NCCL INFO Bootstrap : Using eth0:172.16.121.81<0> ai-master:3543:3543 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 ai-worker01:5725:5725 [0] NCCL INFO cudaDriverVersion 12040 ai-worker01:5725:5725 [0] NCCL INFO Bootstrap : Using eth0:172.16.121.86<0> ai-master:3543:3556 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-master:3543:3556 [0] NCCL INFO P2P plugin IBext_v8 ai-master:3543:3556 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.121.81<0> ai-master:3543:3556 [0] NCCL INFO Using non-device net plugin version 0 ai-master:3543:3556 [0] NCCL INFO Using network IBext_v8 ai-worker01:5725:5736 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-worker01:5725:5736 [0] NCCL INFO P2P plugin IBext_v8 ai-worker01:5725:5736 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.121.86<0> ai-worker01:5725:5736 [0] NCCL INFO Using non-device net plugin version 0 ai-worker01:5725:5736 [0] NCCL INFO Using network IBext_v8 ai-master:3543:3556 [0] NCCL INFO ncclCommInitRank comm 0x5619bbb9a000 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init START ai-worker01:5725:5736 [0] NCCL INFO ncclCommInitRank comm 0x5641dd192b70 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init START ai-worker01:5725:5736 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa,aaaaaaaa,aaaaaaaa ai-master:3543:3556 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa,aaaaaaaa,aaaaaaaa ai-master:3543:3556 [0] NCCL INFO comm 0x5619bbb9a000 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-master:3543:3556 [0] NCCL INFO Channel 00/02 : 0 1 ai-master:3543:3556 [0] NCCL INFO Channel 01/02 : 0 1 ai-master:3543:3556 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 ai-master:3543:3556 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker01:5725:5736 [0] NCCL INFO comm 0x5641dd192b70 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-worker01:5725:5736 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 ai-worker01:5725:5736 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker01:5725:5736 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5725:5736 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5725:5736 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/1 ai-worker01:5725:5736 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Connected all rings ai-master:3543:3556 [0] NCCL INFO Connected all trees ai-worker01:5725:5736 [0] NCCL INFO Connected all rings ai-worker01:5725:5736 [0] NCCL INFO Connected all trees ai-worker01:5725:5736 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ai-worker01:5725:5736 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ai-master:3543:3556 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ai-master:3543:3556 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ai-master:3543:3556 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304. ai-worker01:5725:5736 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. ai-worker01:5725:5736 [0] NCCL INFO ncclCommInitRank comm 0x5641dd192b70 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init COMPLETE ai-master:3543:3556 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. ai-master:3543:3556 [0] NCCL INFO ncclCommInitRank comm 0x5619bbb9a000 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init COMPLETE

  1. Two container on different server.

mpirun -np 2 -x UCX_TLS=tcp -allow-run-as-root -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME-=net1 -host 10.10.11.3,10.10.21.2 /workspace/software/nccl-tests-master/build/reduce_perf -b 8 -e 128M -f 1 -g 1

nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 5771 on ai-worker01 device 0 [0xca] NVIDIA L40

Rank 1 Group 0 Pid 4940 on ai-worker03 device 0 [0x61] NVIDIA L40

ai-worker01:5771:5771 [0] NCCL INFO Bootstrap : Using eth0:172.16.121.86<0> ai-worker01:5771:5771 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 ai-worker03:4940:4940 [0] NCCL INFO cudaDriverVersion 12040 ai-worker03:4940:4940 [0] NCCL INFO Bootstrap : Using eth0:172.16.207.69<0> ai-worker01:5771:5783 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-worker01:5771:5783 [0] NCCL INFO P2P plugin IBext_v8 ai-worker01:5771:5783 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.121.86<0> ai-worker01:5771:5783 [0] NCCL INFO Using non-device net plugin version 0 ai-worker01:5771:5783 [0] NCCL INFO Using network IBext_v8 ai-worker03:4940:4951 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-worker03:4940:4951 [0] NCCL INFO P2P plugin IBext_v8 ai-worker03:4940:4951 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.207.69<0> ai-worker03:4940:4951 [0] NCCL INFO Using non-device net plugin version 0 ai-worker03:4940:4951 [0] NCCL INFO Using network IBext_v8 ai-worker03:4940:4951 [0] NCCL INFO ncclCommInitRank comm 0x55d9607ee240 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 61000 commId 0x1903cb3f44feeaef - Init START ai-worker01:5771:5783 [0] NCCL INFO ncclCommInitRank comm 0x55f1b4257010 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0x1903cb3f44feeaef - Init START ai-worker01:5771:5783 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa,aaaaaaaa,aaaaaaaa ai-worker03:4940:4951 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555,55555555 ai-worker01:5771:5783 [0] NCCL INFO comm 0x55f1b4257010 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-worker01:5771:5783 [0] NCCL INFO Channel 00/02 : 0 1 ai-worker01:5771:5783 [0] NCCL INFO Channel 01/02 : 0 1 ai-worker01:5771:5783 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 ai-worker01:5771:5783 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker03:4940:4951 [0] NCCL INFO comm 0x55d9607ee240 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-worker03:4940:4951 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 ai-worker03:4940:4951 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker01:5771:5783 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5771:5783 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5771:5783 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-worker01:5771:5783 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-worker03:4940:4951 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/0 ai-worker03:4940:4951 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/0 ai-worker03:4940:4951 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/0 ai-worker03:4940:4951 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/0

ai-worker03:4940:4954 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out ai-worker03:4940:4954 [0] NCCL INFO transport/net.cc:837 -> 2 ai-worker03:4940:4951 [0] NCCL INFO transport/net.cc:405 -> 2 ai-worker03:4940:4951 [0] NCCL INFO transport.cc:183 -> 2 ai-worker03:4940:4951 [0] NCCL INFO init.cc:1263 -> 2 ai-worker03:4940:4951 [0] NCCL INFO init.cc:1548 -> 2 ai-worker03:4940:4951 [0] NCCL INFO group.cc:64 -> 2 [Async thread] ai-worker03:4940:4940 [0] NCCL INFO group.cc:418 -> 2 ai-worker03:4940:4940 [0] NCCL INFO group.cc:95 -> 2 ai-worker03: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. ai-worker03 pid 4940: Test failure common.cu:844 ai-worker03:4940:4954 [0] NCCL INFO proxy.cc:1436 -> 3

ai-worker03:4940:4954 [0] proxy.cc:1580 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3

ai-worker03:4940:4954 [1767982624] include/alloc.h:39 NCCL WARN Cuda failure 'driver shutting down' ai-worker03:4940:4954 [2020569680] NCCL INFO transport/net.cc:1000 -> 1 ai-worker03:4940:4954 [541868867] NCCL INFO proxy.cc:988 -> 1 ai-worker03:4940:4954 [541868867] NCCL INFO proxy.cc:1000 -> 1

ai-worker01:5771:5786 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out ai-worker01:5771:5786 [0] NCCL INFO transport/net.cc:837 -> 2 ai-worker01:5771:5783 [0] NCCL INFO transport/net.cc:405 -> 2 ai-worker01:5771:5783 [0] NCCL INFO transport.cc:183 -> 2 ai-worker01:5771:5783 [0] NCCL INFO init.cc:1263 -> 2 ai-worker01:5771:5783 [0] NCCL INFO init.cc:1548 -> 2 ai-worker01:5771:5783 [0] NCCL INFO group.cc:64 -> 2 [Async thread] ai-worker01:5771:5771 [0] NCCL INFO group.cc:418 -> 2 ai-worker01:5771:5771 [0] NCCL INFO group.cc:95 -> 2 ai-worker01: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. ai-worker01 pid 5771: Test failure common.cu:844

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[39707,1],1] Exit code: 3

kiskra-nvidia commented 1 month ago

Please see https://github.com/NVIDIA/nccl/issues/676#issuecomment-1106236615.

Basically, you are using an external network plugin (/opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so), which is controlled by its own variable NCCL_IBEXT_DISABLE that you need to use. You'll probably need to keep NCCL_IB_DISABLE as well though to disable both the external plugin and the internal IB support in NCCL.

Also, I notice in your command line invocations a typo: NCCL_SOCKET_IFNAME- has an extra - at the end.

thsmfe001 commented 1 month ago

Thanks to you solution i solved the problem. Thank you so much.