Open lifengli137 opened 2 years ago
The STDOUTs of NCCL_DEBUG=info CUDA_VISIBLE_DEVICES=2,3 ./all_reduce_perf -b 8 -e 16M -f 2 -g 2
while hanging is as followings:
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 225 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 225 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:225:225 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:225:225 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:225:225 [0] NCCL INFO NET/IB : No device found.
nccl-tests:225:225 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:225:225 [0] NCCL INFO Using network Socket
NCCL version 2.8.2+cuda11.1
nccl-tests:225:240 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
nccl-tests:225:239 [0] NCCL INFO Channel 00/02 : 0 1
nccl-tests:225:239 [0] NCCL INFO Channel 01/02 : 0 1
nccl-tests:225:240 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:225:239 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
nccl-tests:225:239 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:225:240 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:225:239 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:225:240 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:225:239 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:225:240 [1] NCCL INFO Connected all rings
nccl-tests:225:240 [1] NCCL INFO Connected all trees
nccl-tests:225:239 [0] NCCL INFO Connected all rings
nccl-tests:225:240 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:225:239 [0] NCCL INFO Connected all trees
nccl-tests:225:240 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:225:239 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:225:239 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:225:239 [0] NCCL INFO comm 0x7fe5cabec530 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
nccl-tests:225:240 [1] NCCL INFO comm 0x7fe5c0000dc0 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
nccl-tests:225:225 [0] NCCL INFO Launch mode Group/CGMD
I can't think of any reason for that to hang. You're running quite an old version of NCCL though: 2.8.2 have you tried with the latest version 2.11.4 ? I'd also look to see if the hang is related to the message size, by using -b and -e to test individual message sizes such as 8, 4K, 64K, 128K, 1M etc. You could also see if it's related to a particular protocol by setting NCCL_PROTO=[LL|LL128|SIMPLE] in the environment
Also you may want to check whether NCCL_P2P_DISABLE=1
solves the issue. If so, please verify that ACS is disabled.
Actually I'm realizing you have NVLink and the NVLinks are not between 0-1 and 2-3, but instead between 0-3 and 1-2.
So the connection 2-3 is going through PHB. You should definitely try the latest NCCL (2.8.2 isn't the last version of the 2.8 series so it has known bugs). If that doesn't fix it, then you should try NCCL_P2P_LEVEL=PXB
and see if it works better. If it does, that means P2P is broken through the CPU, which could be due to a variety of factors, including VT-d.
Thank you @AddyLaddy @sjeaugey , I did following tests:
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 26327 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 26327 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:26327:26327 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:26327:26327 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:26327:26327 [0] NCCL INFO NET/IB : No device found.
nccl-tests:26327:26327 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:26327:26327 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:26327:26342 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:26327:26341 [0] NCCL INFO Channel 00/04 : 0 1
nccl-tests:26327:26341 [0] NCCL INFO Channel 01/04 : 0 1
nccl-tests:26327:26341 [0] NCCL INFO Channel 02/04 : 0 1
nccl-tests:26327:26341 [0] NCCL INFO Channel 03/04 : 0 1
nccl-tests:26327:26342 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26327:26341 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:26327:26341 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26327:26341 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26341 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26341 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 02 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26341 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Channel 03 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:26327:26342 [1] NCCL INFO Connected all rings
nccl-tests:26327:26341 [0] NCCL INFO Connected all rings
nccl-tests:26327:26342 [1] NCCL INFO Connected all trees
nccl-tests:26327:26341 [0] NCCL INFO Connected all trees
nccl-tests:26327:26342 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:26327:26342 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:26327:26341 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:26327:26341 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:26327:26342 [1] NCCL INFO comm 0x7fa644000f60 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
nccl-tests:26327:26341 [0] NCCL INFO comm 0x7fa650000f60 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
nccl-tests:26327:26327 [0] NCCL INFO Launch mode Parallel
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 245 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 245 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:245:245 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:245:245 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:245:245 [0] NCCL INFO NET/IB : No device found.
nccl-tests:245:245 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:245:245 [0] NCCL INFO Using network Socket
NCCL version 2.8.2+cuda11.1
nccl-tests:245:259 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:245:260 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
nccl-tests:245:259 [0] NCCL INFO Channel 00/02 : 0 1
nccl-tests:245:260 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:245:259 [0] NCCL INFO Channel 01/02 : 0 1
nccl-tests:245:259 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
nccl-tests:245:259 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:245:259 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:245:259 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:245:260 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:245:260 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:245:260 [1] NCCL INFO Connected all rings
nccl-tests:245:260 [1] NCCL INFO Connected all trees
nccl-tests:245:260 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:245:259 [0] NCCL INFO Connected all rings
nccl-tests:245:260 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:245:259 [0] NCCL INFO Connected all trees
nccl-tests:245:259 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
nccl-tests:245:259 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
nccl-tests:245:259 [0] NCCL INFO comm 0x7f2e62bec5d0 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
nccl-tests:245:260 [1] NCCL INFO comm 0x7f2e54000dc0 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
nccl-tests:245:245 [0] NCCL INFO Launch mode Group/CGMD
8 2 float sum 8.17 0.00 0.00 0e+00 8.19 0.00 0.00 0e+00
16 4 float sum 8.22 0.00 0.00 0e+00 8.27 0.00 0.00 0e+00
32 8 float sum 8.17 0.00 0.00 0e+00 8.16 0.00 0.00 0e+00
64 16 float sum 8.67 0.01 0.01 0e+00 8.21 0.01 0.01 0e+00
128 32 float sum 8.15 0.02 0.02 0e+00 8.31 0.02 0.02 0e+00
256 64 float sum 8.20 0.03 0.03 0e+00 8.14 0.03 0.03 0e+00
512 128 float sum 8.14 0.06 0.06 0e+00 8.17 0.06 0.06 0e+00
1024 256 float sum 8.25 0.12 0.12 0e+00 8.21 0.12 0.12 0e+00
2048 512 float sum 8.35 0.25 0.25 0e+00 8.18 0.25 0.25 0e+00
4096 1024 float sum 8.71 0.47 0.47 0e+00 8.72 0.47 0.47 0e+00
8192 2048 float sum 10.33 0.79 0.79 0e+00 10.26 0.80 0.80 0e+00
16384 4096 float sum 14.42 1.14 1.14 0e+00 14.38 1.14 1.14 0e+00
32768 8192 float sum 20.12 1.63 1.63 0e+00 20.00 1.64 1.64 0e+00
65536 16384 float sum 31.52 2.08 2.08 0e+00 31.91 2.05 2.05 0e+00
131072 32768 float sum 46.36 2.83 2.83 0e+00 46.33 2.83 2.83 0e+00
262144 65536 float sum 73.91 3.55 3.55 0e+00 75.32 3.48 3.48 0e+00
524288 131072 float sum 143.7 3.65 3.65 0e+00 140.2 3.74 3.74 0e+00
1048576 262144 float sum 275.4 3.81 3.81 0e+00 280.4 3.74 3.74 0e+00
2097152 524288 float sum 549.0 3.82 3.82 0e+00 552.0 3.80 3.80 0e+00
4194304 1048576 float sum 1082.6 3.87 3.87 0e+00 1064.3 3.94 3.94 0e+00
8388608 2097152 float sum 2119.3 3.96 3.96 0e+00 2119.0 3.96 3.96 0e+00
16777216 4194304 float sum 4513.5 3.72 3.72 0e+00 4489.3 3.74 3.74 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.62784
#
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 26383 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 26383 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:26383:26383 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:26383:26383 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:26383:26383 [0] NCCL INFO NET/IB : No device found.
nccl-tests:26383:26383 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:26383:26383 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:26383:26397 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:26383:26398 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:26383:26397 [0] NCCL INFO Channel 00/04 : 0 1
nccl-tests:26383:26397 [0] NCCL INFO Channel 01/04 : 0 1
nccl-tests:26383:26398 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26383:26397 [0] NCCL INFO Channel 02/04 : 0 1
nccl-tests:26383:26397 [0] NCCL INFO Channel 03/04 : 0 1
nccl-tests:26383:26397 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:26383:26397 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:26383:26397 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:26383:26397 [0] NCCL INFO include/shm.h:41 -> 2
nccl-tests:26383:26398 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:26383:26398 [1] NCCL INFO include/shm.h:41 -> 2
nccl-tests:26383:26397 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-20db55a2180732ae-0-1-0 (size 9637888)
nccl-tests:26383:26398 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-20db55a2180732ae-0-0-1 (size 9637888)
nccl-tests:26383:26397 [0] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:26383:26398 [1] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:26383:26397 [0] NCCL INFO transport.cc:34 -> 2
nccl-tests:26383:26398 [1] NCCL INFO transport.cc:34 -> 2
nccl-tests:26383:26397 [0] NCCL INFO transport.cc:87 -> 2
nccl-tests:26383:26398 [1] NCCL INFO transport.cc:87 -> 2
nccl-tests:26383:26397 [0] NCCL INFO init.cc:804 -> 2
nccl-tests:26383:26398 [1] NCCL INFO init.cc:804 -> 2
nccl-tests:26383:26397 [0] NCCL INFO init.cc:941 -> 2
nccl-tests:26383:26398 [1] NCCL INFO init.cc:941 -> 2
nccl-tests:26383:26397 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:26383:26398 [1] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:26383:26383 [1] NCCL INFO init.cc:1010 -> 2
nccl-tests: Test NCCL failure common.cu:1098 'unhandled system error'
.. nccl-tests pid 26383: Test failure common.cu:1005
Even with those BIOS settings mentioned above, there are still many ACSCtl: SrcValid+
in the STDOUTs of sudo lspci -vvv | grep ACSCtl
by following with this link
Since the machine uses AMD PCIe switch, the workaround, mentioned in the link, for Broadcom PLX is not applicable.
$ sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
I will forward this open issue to hardware team.
It seems shared memory creation failed on 2.11.4 because /dev/shm
is full.
Call to posix_fallocate failed : No space left on device
Anyway, I'll assume shared memory works, but P2P does not work between GPUs 2 and 3. Did you try disabling IOMMU in the BIOS or pass iommu=pt
to the linux kernel cmdline?
You can disable ACS on all devices that support it by running a script such as this one:
It seems shared memory creation failed on 2.11.4 because
/dev/shm
is full.Call to posix_fallocate failed : No space left on device
Anyway, I'll assume shared memory works, but P2P does not work between GPUs 2 and 3. Did you try disabling IOMMU in the BIOS or pass
iommu=pt
to the linux kernel cmdline?
Thank you very much @sjeaugey! I will get back to this shared memory issue after the P2P issue resolved.
You can disable ACS on all devices that support it by running a script such as this one:
Thank you @AddyLaddy!
The following is the STDOUTs of the script you posted:
sudo bash ./acs.sh
0000:00:07.1 (ecap 000d @2a0) @2a6 0000
0000:00:08.1 (ecap 000d @2a0) @2a6 0000
0000:03:00.0 (ecap 000d @2a0) @2a6 0000
0000:03:00.2 (ecap 000d @2a0) @2a6 0000
0000:04:00.0 (ecap 000d @2a0) @2a6 0000
0000:04:00.2 (ecap 000d @2a0) @2a6 0000
0000:04:00.3 (ecap 000d @2a0) @2a6 0000
0000:40:07.1 (ecap 000d @2a0) @2a6 0000
0000:40:08.1 (ecap 000d @2a0) @2a6 0000
0000:40:08.2 (ecap 000d @2a0) @2a6 0000
0000:40:08.3 (ecap 000d @2a0) @2a6 0000
0000:47:00.0 (ecap 000d @2a0) @2a6 0000
0000:47:00.2 (ecap 000d @2a0) @2a6 0000
0000:48:00.0 (ecap 000d @2a0) @2a6 0000
0000:48:00.1 (ecap 000d @2a0) @2a6 0000
0000:48:00.2 (ecap 000d @2a0) @2a6 0000
0000:48:00.3 (ecap 000d @2a0) @2a6 0000
0000:49:00.0 (ecap 000d @2a0) @2a6 0000
0000:4a:00.0 (ecap 000d @2a0) @2a6 0000
0000:80:07.1 (ecap 000d @2a0) @2a6 0000
0000:80:08.1 (ecap 000d @2a0) @2a6 0000
0000:80:08.2 (ecap 000d @2a0) @2a6 0000
0000:80:08.3 (ecap 000d @2a0) @2a6 0000
0000:83:00.0 (ecap 000d @2a0) @2a6 0000
0000:83:00.2 (ecap 000d @2a0) @2a6 0000
0000:84:00.0 (ecap 000d @2a0) @2a6 0000
0000:84:00.2 (ecap 000d @2a0) @2a6 0000
0000:85:00.0 (ecap 000d @2a0) @2a6 0000
0000:86:00.0 (ecap 000d @2a0) @2a6 0000
0000:c0:07.1 (ecap 000d @2a0) @2a6 0000
0000:c0:08.1 (ecap 000d @2a0) @2a6 0000
0000:c3:00.0 (ecap 000d @2a0) @2a6 0000
0000:c3:00.2 (ecap 000d @2a0) @2a6 0000
0000:c4:00.0 (ecap 000d @2a0) @2a6 0000
0000:c4:00.2 (ecap 000d @2a0) @2a6 0000
After executed it, it seemed that all the SrcValid+
became SrcValid-
; the STDOUTs of sudo lspci -vvv | grep ACSCtl
is as followings:
sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
However, the NCCL-test is still hanging.
It seems shared memory creation failed on 2.11.4 because
/dev/shm
is full.Call to posix_fallocate failed : No space left on device
Anyway, I'll assume shared memory works, but P2P does not work between GPUs 2 and 3. Did you try disabling IOMMU in the BIOS or pass
iommu=pt
to the linux kernel cmdline?
Hi @sjeaugey, we disabled the IOMMU
in the BIOS. The issue Call to posix_fallocate failed : No space left on device
also persisted:
$ NCCL_P2P_DISABLE=1 LD_LIBRARY_PATH=/opt/nccl/build/lib/ CUDA_VISIBLE_DEVICES=2,3 NCCL_DEBUG=INFO /opt/nccl-tests/build/all_reduce_perf -b 8 -e 16M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 286 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 286 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:286:286 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.2<0>
nccl-tests:286:286 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:286:286 [0] NCCL INFO NET/IB : No device found.
nccl-tests:286:286 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.2<0>
nccl-tests:286:286 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:286:301 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:286:300 [0] NCCL INFO Channel 00/04 : 0 1
nccl-tests:286:301 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:286:300 [0] NCCL INFO Channel 01/04 : 0 1
nccl-tests:286:300 [0] NCCL INFO Channel 02/04 : 0 1
nccl-tests:286:300 [0] NCCL INFO Channel 03/04 : 0 1
nccl-tests:286:300 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:286:300 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:286:301 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:286:300 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:286:300 [0] NCCL INFO include/shm.h:41 -> 2
nccl-tests:286:300 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e3c757a8fd737043-3-1-0 (size 9637888)
nccl-tests:286:300 [0] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:286:300 [0] NCCL INFO transport.cc:34 -> 2
nccl-tests:286:300 [0] NCCL INFO transport.cc:87 -> 2
nccl-tests:286:300 [0] NCCL INFO init.cc:804 -> 2
nccl-tests:286:300 [0] NCCL INFO init.cc:941 -> 2
nccl-tests:286:300 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:286:301 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
nccl-tests:286:301 [1] NCCL INFO include/shm.h:41 -> 2
nccl-tests:286:301 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e3c757a8fd737043-3-0-1 (size 9637888)
nccl-tests:286:301 [1] NCCL INFO transport/shm.cc:100 -> 2
nccl-tests:286:301 [1] NCCL INFO transport.cc:34 -> 2
nccl-tests:286:301 [1] NCCL INFO transport.cc:87 -> 2
nccl-tests:286:301 [1] NCCL INFO init.cc:804 -> 2
nccl-tests:286:301 [1] NCCL INFO init.cc:941 -> 2
nccl-tests:286:301 [1] NCCL INFO group.cc:72 -> 2 [Async thread]
nccl-tests:286:286 [1] NCCL INFO init.cc:1010 -> 2
nccl-tests: Test NCCL failure common.cu:1017 'unhandled system error'
.. nccl-tests pid 286: Test failure common.cu:925
Iommu off is meant to make P2P work again. So you should try to remove NCCL_P2P_DISABLE=1
.
To solve the shared memory creation issue you need to make sure /dev/shm has enough space.
Thank you @sjeaugey, after increasing the size of /dev/shm
, the NCCL-test with NCCL_P2P_DISABLE=1
was working, as shown as followings:
$ df -h
Filesystem Size Used Avail Use% Mounted on
overlay 1.8T 96G 1.6T 6% /
tmpfs 64M 0 64M 0% /dev
tmpfs 498G 0 498G 0% /sys/fs/cgroup
shm 1.0G 0 1.0G 0% /dev/shm
/dev/nvme0n1p2 1.8T 96G 1.6T 6% /etc/hosts
tmpfs 498G 12K 498G 1% /proc/driver/nvidia
tmpfs 100G 32M 100G 1% /run/nvidia-persistenced/socket
udev 498G 0 498G 0% /dev/nvidia2
$ LD_LIBRARY_PATH=/opt/nccl/build/lib/ CUDA_VISIBLE_DEVICES=2,3 NCCL_P2P_DISABLE=1 NCCL_DEBUG=INFO /opt/nccl-tests/build/all_reduce_perf -b 8 -e 16M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 233 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 233 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:233:233 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.15<0>
nccl-tests:233:233 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:233:233 [0] NCCL INFO NET/IB : No device found.
nccl-tests:233:233 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.15<0>
nccl-tests:233:233 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:233:247 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
nccl-tests:233:248 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
nccl-tests:233:247 [0] NCCL INFO Channel 00/04 : 0 1
nccl-tests:233:247 [0] NCCL INFO Channel 01/04 : 0 1
nccl-tests:233:248 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:233:247 [0] NCCL INFO Channel 02/04 : 0 1
nccl-tests:233:247 [0] NCCL INFO Channel 03/04 : 0 1
nccl-tests:233:247 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
nccl-tests:233:247 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:233:248 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:233:248 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:233:247 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:233:248 [1] NCCL INFO Channel 02 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:233:247 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:233:248 [1] NCCL INFO Channel 03 : 1[c2000] -> 0[c1000] via direct shared memory
nccl-tests:233:247 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:233:247 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c2000] via direct shared memory
nccl-tests:233:247 [0] NCCL INFO Connected all rings
nccl-tests:233:247 [0] NCCL INFO Connected all trees
nccl-tests:233:248 [1] NCCL INFO Connected all rings
nccl-tests:233:247 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:233:248 [1] NCCL INFO Connected all trees
nccl-tests:233:247 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:233:248 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:233:248 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:233:247 [0] NCCL INFO comm 0x7fec50000f60 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
nccl-tests:233:248 [1] NCCL INFO comm 0x7fec48000f60 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
nccl-tests:233:233 [0] NCCL INFO Launch mode Parallel
8 2 float sum 7.76 0.00 0.00 0e+00 7.79 0.00 0.00 0e+00
16 4 float sum 7.79 0.00 0.00 0e+00 7.75 0.00 0.00 0e+00
32 8 float sum 8.09 0.00 0.00 0e+00 7.87 0.00 0.00 0e+00
64 16 float sum 7.84 0.01 0.01 0e+00 8.26 0.01 0.01 0e+00
128 32 float sum 7.95 0.02 0.02 0e+00 7.91 0.02 0.02 0e+00
256 64 float sum 7.87 0.03 0.03 0e+00 7.85 0.03 0.03 0e+00
512 128 float sum 7.83 0.07 0.07 0e+00 7.81 0.07 0.07 0e+00
1024 256 float sum 8.19 0.13 0.13 0e+00 7.97 0.13 0.13 0e+00
2048 512 float sum 7.98 0.26 0.26 0e+00 7.97 0.26 0.26 0e+00
4096 1024 float sum 8.43 0.49 0.49 0e+00 8.49 0.48 0.48 0e+00
8192 2048 float sum 9.94 0.82 0.82 0e+00 10.11 0.81 0.81 0e+00
16384 4096 float sum 14.17 1.16 1.16 0e+00 13.96 1.17 1.17 0e+00
32768 8192 float sum 20.96 1.56 1.56 0e+00 21.16 1.55 1.55 0e+00
65536 16384 float sum 35.67 1.84 1.84 0e+00 36.83 1.78 1.78 0e+00
131072 32768 float sum 48.87 2.68 2.68 0e+00 48.07 2.73 2.73 0e+00
262144 65536 float sum 87.16 3.01 3.01 0e+00 87.70 2.99 2.99 0e+00
524288 131072 float sum 156.3 3.35 3.35 0e+00 154.1 3.40 3.40 0e+00
1048576 262144 float sum 334.1 3.14 3.14 0e+00 330.3 3.17 3.17 0e+00
2097152 524288 float sum 670.8 3.13 3.13 0e+00 670.2 3.13 3.13 0e+00
4194304 1048576 float sum 1361.9 3.08 3.08 0e+00 1360.8 3.08 3.08 0e+00
8388608 2097152 float sum 2763.9 3.04 3.04 0e+00 2774.1 3.02 3.02 0e+00
16777216 4194304 float sum 5801.7 2.89 2.89 0e+00 5808.1 2.89 2.89 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.39588
#
On the other hand, it still did not work through P2P with the settings that both IOMMU
and ACS
in North Bridge were disabled. The BIOS configurations were shown in the previous comment.
$ LD_LIBRARY_PATH=/opt/nccl/build/lib/ CUDA_VISIBLE_DEVICES=2,3 NCCL_DEBUG=INFO /opt/nccl-tests/build/all_reduce_perf -b 8 -e 16M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 16777216 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 253 on nccl-tests device 0 [0xc1] NVIDIA RTX A6000
# Rank 1 Pid 253 on nccl-tests device 1 [0xc2] NVIDIA RTX A6000
nccl-tests:253:253 [0] NCCL INFO Bootstrap : Using eno1np0:172.31.45.15<0>
nccl-tests:253:253 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
nccl-tests:253:253 [0] NCCL INFO NET/IB : No device found.
nccl-tests:253:253 [0] NCCL INFO NET/Socket : Using [0]eno1np0:172.31.45.15<0>
nccl-tests:253:253 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.1
nccl-tests:253:268 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-
nccl-tests:253:267 [0] NCCL INFO Channel 00/04 : 0 1
nccl-tests:253:268 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:253:267 [0] NCCL INFO Channel 01/04 : 0 1
nccl-tests:253:267 [0] NCCL INFO Channel 02/04 : 0 1
nccl-tests:253:267 [0] NCCL INFO Channel 03/04 : 0 1
nccl-tests:253:267 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->
nccl-tests:253:267 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff,ffffffff
nccl-tests:253:268 [1] NCCL INFO Channel 00 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:253:268 [1] NCCL INFO Channel 01 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:253:268 [1] NCCL INFO Channel 02 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:253:267 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:253:268 [1] NCCL INFO Channel 03 : 1[c2000] -> 0[c1000] via P2P/direct pointer
nccl-tests:253:267 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:253:267 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:253:267 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c2000] via P2P/direct pointer
nccl-tests:253:268 [1] NCCL INFO Connected all rings
nccl-tests:253:267 [0] NCCL INFO Connected all rings
nccl-tests:253:268 [1] NCCL INFO Connected all trees
nccl-tests:253:267 [0] NCCL INFO Connected all trees
nccl-tests:253:268 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:253:268 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:253:267 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
nccl-tests:253:267 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
nccl-tests:253:267 [0] NCCL INFO comm 0x7f26b0000f60 rank 0 nranks 2 cudaDev 0 busId c1000 - Init COMPLETE
nccl-tests:253:268 [1] NCCL INFO comm 0x7f26a8000f60 rank 1 nranks 2 cudaDev 1 busId c2000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
nccl-tests:253:253 [0] NCCL INFO Launch mode Parallel
Ok, thanks. Aside from adding iommu=pt
to the linux kernel cmdline, I'm out of ideas. Besides, the fact only GPU 2 and 3 have that issue tends to show that the configuration might not be the problem. Reseating/swapping cards can sometimes pinpoint the problem to a particular GPU/PCI slot.
Hi NCCL team,
I downloaded NCCL test code from GitHub and run on 4-GPU workstation.
The tests on pairs of GPUs (0-1, 0-2, 0-3, 1-2, 1-3) were run normally as expected; but hang all the time if tested between GPU2 and GPU3: CUDA_VISIBLE_DEVICES=2,3 ./all_reduce_perf -b 8 -e 16M -f 2 -g 2
The stack while hanging is as followings:
Outputs of nvidia-smi:
System topology:
Outputs of p2pBandwidthLatencyTest:
Is there any problem in workstation's hardware or it is a software error?
Thank you!