Closed thsmfe001 closed 1 month ago
Please see https://github.com/NVIDIA/nccl/issues/676#issuecomment-1106236615.
Basically, you are using an external network plugin (/opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
), which is controlled by its own variable NCCL_IBEXT_DISABLE
that you need to use. You'll probably need to keep NCCL_IB_DISABLE
as well though to disable both the external plugin and the internal IB support in NCCL.
Also, I notice in your command line invocations a typo: NCCL_SOCKET_IFNAME-
has an extra -
at the end.
Thanks to you solution i solved the problem. Thank you so much.
Two container on same server was test good. But Two container on different server was test fail. Also docker based test was good in spite of using different servers, After changed K8S, below issue was happended. Below is logs from two cases.
mpirun -np 2 -x UCX_TLS=tcp -allow-run-as-root -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME-=net1 -host 10.10.10.16,10.10.11.3,10.10.20.2,10.10.21.2 /workspace/software/nccl-tests-master/build/reduce_perf -b 8 -e 128M -f 1 -g 1
nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
Using devices
Rank 0 Group 0 Pid 3543 on ai-master device 0 [0xca] NVIDIA L40
Rank 1 Group 0 Pid 5725 on ai-worker01 device 0 [0xca] NVIDIA L40
ai-master:3543:3543 [0] NCCL INFO Bootstrap : Using eth0:172.16.121.81<0> ai-master:3543:3543 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 ai-worker01:5725:5725 [0] NCCL INFO cudaDriverVersion 12040 ai-worker01:5725:5725 [0] NCCL INFO Bootstrap : Using eth0:172.16.121.86<0> ai-master:3543:3556 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-master:3543:3556 [0] NCCL INFO P2P plugin IBext_v8 ai-master:3543:3556 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.121.81<0> ai-master:3543:3556 [0] NCCL INFO Using non-device net plugin version 0 ai-master:3543:3556 [0] NCCL INFO Using network IBext_v8 ai-worker01:5725:5736 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-worker01:5725:5736 [0] NCCL INFO P2P plugin IBext_v8 ai-worker01:5725:5736 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.121.86<0> ai-worker01:5725:5736 [0] NCCL INFO Using non-device net plugin version 0 ai-worker01:5725:5736 [0] NCCL INFO Using network IBext_v8 ai-master:3543:3556 [0] NCCL INFO ncclCommInitRank comm 0x5619bbb9a000 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init START ai-worker01:5725:5736 [0] NCCL INFO ncclCommInitRank comm 0x5641dd192b70 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init START ai-worker01:5725:5736 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa,aaaaaaaa,aaaaaaaa ai-master:3543:3556 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa,aaaaaaaa,aaaaaaaa ai-master:3543:3556 [0] NCCL INFO comm 0x5619bbb9a000 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-master:3543:3556 [0] NCCL INFO Channel 00/02 : 0 1 ai-master:3543:3556 [0] NCCL INFO Channel 01/02 : 0 1 ai-master:3543:3556 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 ai-master:3543:3556 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker01:5725:5736 [0] NCCL INFO comm 0x5641dd192b70 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-worker01:5725:5736 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 ai-worker01:5725:5736 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker01:5725:5736 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5725:5736 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5725:5736 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/1 ai-worker01:5725:5736 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-master:3543:3556 [0] NCCL INFO Connected all rings ai-master:3543:3556 [0] NCCL INFO Connected all trees ai-worker01:5725:5736 [0] NCCL INFO Connected all rings ai-worker01:5725:5736 [0] NCCL INFO Connected all trees ai-worker01:5725:5736 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ai-worker01:5725:5736 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ai-master:3543:3556 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 ai-master:3543:3556 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ai-master:3543:3556 [0] NCCL INFO NCCL_WORK_FIFO_DEPTH set by environment to 4194304. ai-worker01:5725:5736 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. ai-worker01:5725:5736 [0] NCCL INFO ncclCommInitRank comm 0x5641dd192b70 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init COMPLETE ai-master:3543:3556 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2, using internal tuner instead. ai-master:3543:3556 [0] NCCL INFO ncclCommInitRank comm 0x5619bbb9a000 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0xfedd3c4fe2ca42eb - Init COMPLETE
mpirun -np 2 -x UCX_TLS=tcp -allow-run-as-root -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=1 -x NCCL_SOCKET_IFNAME-=net1 -host 10.10.11.3,10.10.21.2 /workspace/software/nccl-tests-master/build/reduce_perf -b 8 -e 128M -f 1 -g 1
nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
Using devices
Rank 0 Group 0 Pid 5771 on ai-worker01 device 0 [0xca] NVIDIA L40
Rank 1 Group 0 Pid 4940 on ai-worker03 device 0 [0x61] NVIDIA L40
ai-worker01:5771:5771 [0] NCCL INFO Bootstrap : Using eth0:172.16.121.86<0> ai-worker01:5771:5771 [0] NCCL INFO cudaDriverVersion 12040 NCCL version 2.21.5+cuda12.4 ai-worker03:4940:4940 [0] NCCL INFO cudaDriverVersion 12040 ai-worker03:4940:4940 [0] NCCL INFO Bootstrap : Using eth0:172.16.207.69<0> ai-worker01:5771:5783 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-worker01:5771:5783 [0] NCCL INFO P2P plugin IBext_v8 ai-worker01:5771:5783 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.121.86<0> ai-worker01:5771:5783 [0] NCCL INFO Using non-device net plugin version 0 ai-worker01:5771:5783 [0] NCCL INFO Using network IBext_v8 ai-worker03:4940:4951 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so ai-worker03:4940:4951 [0] NCCL INFO P2P plugin IBext_v8 ai-worker03:4940:4951 [0] NCCL INFO NET/IB : Using [0]irdma0:1/RoCE [1]irdma3:1/RoCE [RO]; OOB eth0:172.16.207.69<0> ai-worker03:4940:4951 [0] NCCL INFO Using non-device net plugin version 0 ai-worker03:4940:4951 [0] NCCL INFO Using network IBext_v8 ai-worker03:4940:4951 [0] NCCL INFO ncclCommInitRank comm 0x55d9607ee240 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 61000 commId 0x1903cb3f44feeaef - Init START ai-worker01:5771:5783 [0] NCCL INFO ncclCommInitRank comm 0x55f1b4257010 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId ca000 commId 0x1903cb3f44feeaef - Init START ai-worker01:5771:5783 [0] NCCL INFO Setting affinity for GPU 0 to aaaa,aaaaaaaa,aaaaaaaa,aaaaaaaa ai-worker03:4940:4951 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555,55555555 ai-worker01:5771:5783 [0] NCCL INFO comm 0x55f1b4257010 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-worker01:5771:5783 [0] NCCL INFO Channel 00/02 : 0 1 ai-worker01:5771:5783 [0] NCCL INFO Channel 01/02 : 0 1 ai-worker01:5771:5783 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 ai-worker01:5771:5783 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker03:4940:4951 [0] NCCL INFO comm 0x55d9607ee240 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 ai-worker03:4940:4951 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 ai-worker03:4940:4951 [0] NCCL INFO P2P Chunksize set to 131072 ai-worker01:5771:5783 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5771:5783 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext_v8/1 ai-worker01:5771:5783 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-worker01:5771:5783 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext_v8/1 ai-worker03:4940:4951 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/0 ai-worker03:4940:4951 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext_v8/0 ai-worker03:4940:4951 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/0 ai-worker03:4940:4951 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext_v8/0
ai-worker03:4940:4954 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out ai-worker03:4940:4954 [0] NCCL INFO transport/net.cc:837 -> 2 ai-worker03:4940:4951 [0] NCCL INFO transport/net.cc:405 -> 2 ai-worker03:4940:4951 [0] NCCL INFO transport.cc:183 -> 2 ai-worker03:4940:4951 [0] NCCL INFO init.cc:1263 -> 2 ai-worker03:4940:4951 [0] NCCL INFO init.cc:1548 -> 2 ai-worker03:4940:4951 [0] NCCL INFO group.cc:64 -> 2 [Async thread] ai-worker03:4940:4940 [0] NCCL INFO group.cc:418 -> 2 ai-worker03:4940:4940 [0] NCCL INFO group.cc:95 -> 2 ai-worker03: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. ai-worker03 pid 4940: Test failure common.cu:844 ai-worker03:4940:4954 [0] NCCL INFO proxy.cc:1436 -> 3
ai-worker03:4940:4954 [0] proxy.cc:1580 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
ai-worker03:4940:4954 [1767982624] include/alloc.h:39 NCCL WARN Cuda failure 'driver shutting down' ai-worker03:4940:4954 [2020569680] NCCL INFO transport/net.cc:1000 -> 1 ai-worker03:4940:4954 [541868867] NCCL INFO proxy.cc:988 -> 1 ai-worker03:4940:4954 [541868867] NCCL INFO proxy.cc:1000 -> 1
ai-worker01:5771:5786 [0] ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out ai-worker01:5771:5786 [0] NCCL INFO transport/net.cc:837 -> 2 ai-worker01:5771:5783 [0] NCCL INFO transport/net.cc:405 -> 2 ai-worker01:5771:5783 [0] NCCL INFO transport.cc:183 -> 2 ai-worker01:5771:5783 [0] NCCL INFO init.cc:1263 -> 2 ai-worker01:5771:5783 [0] NCCL INFO init.cc:1548 -> 2 ai-worker01:5771:5783 [0] NCCL INFO group.cc:64 -> 2 [Async thread] ai-worker01:5771:5771 [0] NCCL INFO group.cc:418 -> 2 ai-worker01:5771:5771 [0] NCCL INFO group.cc:95 -> 2 ai-worker01: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. ai-worker01 pid 5771: Test failure common.cu:844
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[39707,1],1] Exit code: 3