Closed mjlbach closed 5 years ago
Thanks for your report. Can we get the output of NCCL_DEBUG=INFO
please and perhaps a gdb backtrace to help us narrow down the failure point.
Here's the output:
mjlbach@node05-ccncluster:~/nccl-tests$ NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 48937 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 48937 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 48937 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 48937 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 48937 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 48937 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 48937 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 48937 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 48937 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 48937 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:48937:48937 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:48937:48937 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:48937:48937 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:48937:48937 [9] NCCL INFO nranks 10
node05-ccncluster:48937:48937 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:48937:48937 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:48937:48937 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:48937:48937 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:48937:48937 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:48937:48937 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:48937:48937 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:48937:48937 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:48937:48937 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:48937:48937 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:48937:48937 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:48937:48937 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:48937:48937 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:48937:48937 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:48937:48937 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:48937:48937 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:48937:48937 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:48937:48937 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:48937:48937 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:48937:48937 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:48937:48937 [8] transport/p2p.cc:487 NCCL WARN failed to peer with device 9(=9): 60 peer mapping resources exhausted
node05-ccncluster:48937:48937 [8] NCCL INFO init.cc:339 -> 3
node05-ccncluster:48937:48937 [8] NCCL INFO init.cc:1072 -> 3
node05-ccncluster:48937:48937 [8] NCCL INFO init.cc:1140 -> 3
node05-ccncluster:48937:48937 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:48937:48937 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:48937:48937 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:48937:48937 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:48937:48937 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:48937:48937 [8] NCCL INFO transport/shm.cc:229 -> 1
node05-ccncluster:48937:48937 [8] NCCL INFO channel.cc:47 -> 1
node05-ccncluster:48937:48937 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:48937:48937 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:48937:48937 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:48937:48937 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:48937:48937 [8] NCCL INFO init.cc:203 -> 1
Segmentation fault (core dumped)
And the gdb backtrace
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./build/all_reduce_perf...done.
(gdb) run -b 8 -e 128M -f 2 -g 10
Starting program: /home/mjlbach/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
[New Thread 0x7fffe8ab4700 (LWP 49712)]
# Rank 0 Pid 49697 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 49697 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 49697 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 49697 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 49697 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 49697 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 49697 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 49697 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 49697 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 49697 on node05-ccncluster device 9 [0x41] TITAN Xp
[New Thread 0x7fffe7864700 (LWP 49713)]
[New Thread 0x7fffe7063700 (LWP 49714)]
[New Thread 0x7fffe3fff700 (LWP 49715)]
[New Thread 0x7fffe37fe700 (LWP 49716)]
[New Thread 0x7fffe2ffd700 (LWP 49717)]
[New Thread 0x7fffe27fc700 (LWP 49718)]
[New Thread 0x7fffe1ffb700 (LWP 49720)]
[New Thread 0x7fffe17fa700 (LWP 49721)]
[New Thread 0x7fffe0ff9700 (LWP 49722)]
[New Thread 0x7fffc5fff700 (LWP 49723)]
[New Thread 0x7fffc57fe700 (LWP 49724)]
[New Thread 0x7fffc4ffd700 (LWP 49725)]
[New Thread 0x7fff99fff700 (LWP 49726)]
[New Thread 0x7fff997fe700 (LWP 49727)]
[New Thread 0x7fff98ffd700 (LWP 49728)]
[New Thread 0x7fff6dfff700 (LWP 49729)]
[New Thread 0x7fff6d7fe700 (LWP 49730)]
[New Thread 0x7fff6cffd700 (LWP 49731)]
[New Thread 0x7fff41fff700 (LWP 49732)]
[New Thread 0x7fff417fe700 (LWP 49733)]
[New Thread 0x7fff40ffd700 (LWP 49735)]
Thread 1 "all_reduce_perf" received signal SIGSEGV, Segmentation fault.
freeChannel (channel=channel@entry=0x1e65a780, nRanks=<optimized out>) at channel.cc:47
47 channel.cc: No such file or directory.
(gdb) backtrace
#0 freeChannel (channel=channel@entry=0x1e65a780, nRanks=<optimized out>) at channel.cc:47
#1 0x00007ffff1120944 in commFree (comm=0x1e65a780) at init.cc:203
#2 0x00007ffff11244de in ncclCommInitAll (comms=comms@entry=0x547ebd0, ndev=<optimized out>, devlist=<optimized out>) at init.cc:1154
#3 0x0000000000407f33 in run () at common.cu:775
#4 0x0000000000401c1d in main (argc=9, argv=0x7fffffffbde8) at common.cu:694
(gdb)
Can you try setting NCCL_P2P_LEVEL=2
?
NCCL_DEBUG=INFO NCCL_P2P_LEVEL=2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 115545 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 115545 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 115545 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 115545 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 115545 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 115545 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 115545 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 115545 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 115545 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 115545 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:115545:115545 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:115545:115545 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:115545:115545 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:115545:115545 [9] NCCL INFO nranks 10
node05-ccncluster:115545:115545 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:115545:115545 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:115545:115545 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:115545:115545 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:115545:115545 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:115545:115545 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:115545:115545 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:115545:115545 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:115545:115545 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:115545:115545 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:115545:115545 [9] NCCL INFO NCCL_P2P_LEVEL set by environment to 2.
node05-ccncluster:115545:115545 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:115545:115545 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:115545:115545 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:115545:115545 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:115545:115545 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:115545:115545 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:115545:115545 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:115545:115545 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:115545:115545 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:115545:115545 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:115545:115545 [8] transport/p2p.cc:487 NCCL WARN failed to peer with device 9(=9): 60 peer mapping resources exhausted
node05-ccncluster:115545:115545 [8] NCCL INFO init.cc:339 -> 3
node05-ccncluster:115545:115545 [8] NCCL INFO init.cc:1072 -> 3
node05-ccncluster:115545:115545 [8] NCCL INFO init.cc:1140 -> 3
node05-ccncluster:115545:115545 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:115545:115545 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:115545:115545 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:115545:115545 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:115545:115545 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:115545:115545 [8] NCCL INFO transport/shm.cc:229 -> 1
node05-ccncluster:115545:115545 [8] NCCL INFO channel.cc:47 -> 1
node05-ccncluster:115545:115545 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:115545:115545 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:115545:115545 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:115545:115545 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:115545:115545 [8] NCCL INFO init.cc:203 -> 1
Segmentation fault (core dumped)
Can you confirm that NCCL_P2P_LEVEL=0
works ? (it should -- just making sure we're not missing something).
If it does, could you recompile NCCL commenting out that part : https://github.com/NVIDIA/nccl/blob/master/src/transport/p2p.cc#L96-L103 and see if it works without NCCL_P2P_LEVEL
being set ?
It works with NCCL_P2P_LEVEL=0
NCCL_DEBUG=INFO NCCL_P2P_LEVEL=0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 132195 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 132195 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 132195 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 132195 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 132195 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 132195 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 132195 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 132195 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 132195 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 132195 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:132195:132195 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:132195:132195 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:132195:132195 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:132195:132195 [9] NCCL INFO nranks 10
node05-ccncluster:132195:132195 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:132195:132195 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:132195:132195 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:132195:132195 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:132195:132195 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:132195:132195 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:132195:132195 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:132195:132195 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:132195:132195 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:132195:132195 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:132195:132195 [9] NCCL INFO NCCL_P2P_LEVEL set by environment to 0.
node05-ccncluster:132195:132195 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:132195:132195 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:132195:132195 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
node05-ccncluster:132195:132195 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
node05-ccncluster:132195:132195 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
node05-ccncluster:132195:132195 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via direct shared memory
node05-ccncluster:132195:132195 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:132195:132195 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via direct shared memory
node05-ccncluster:132195:132195 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via direct shared memory
node05-ccncluster:132195:132195 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via direct shared memory
node05-ccncluster:132195:132195 [8] NCCL INFO Ring 00 : 8[8] -> 9[9] via direct shared memory
node05-ccncluster:132195:132195 [9] NCCL INFO Ring 00 : 9[9] -> 0[0] via direct shared memory
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
node05-ccncluster:132195:132195 [0] NCCL INFO Launch mode Group/CGMD
8 2 float sum 66.39 0.00 0.00 1e-07 59.17 0.00 0.00 1e-07
16 4 float sum 48.06 0.00 0.00 1e-07 45.48 0.00 0.00 1e-07
32 8 float sum 48.92 0.00 0.00 6e-08 45.11 0.00 0.00 6e-08
64 16 float sum 47.53 0.00 0.00 6e-08 45.16 0.00 0.00 6e-08
128 32 float sum 47.65 0.00 0.00 6e-08 45.19 0.00 0.01 6e-08
256 64 float sum 47.39 0.01 0.01 3e-08 49.23 0.01 0.01 3e-08
512 128 float sum 48.14 0.01 0.02 3e-08 45.13 0.01 0.02 3e-08
1024 256 float sum 49.41 0.02 0.04 2e-07 45.50 0.02 0.04 2e-07
2048 512 float sum 49.76 0.04 0.07 2e-07 45.77 0.04 0.08 2e-07
4096 1024 float sum 48.36 0.08 0.15 2e-07 46.82 0.09 0.16 2e-07
8192 2048 float sum 49.11 0.17 0.30 2e-07 47.18 0.17 0.31 2e-07
16384 4096 float sum 56.36 0.29 0.52 2e-07 56.69 0.29 0.52 2e-07
32768 8192 float sum 91.89 0.36 0.64 2e-07 91.69 0.36 0.64 2e-07
65536 16384 float sum 162.9 0.40 0.72 2e-07 162.9 0.40 0.72 2e-07
131072 32768 float sum 199.2 0.66 1.18 2e-07 200.8 0.65 1.18 2e-07
262144 65536 float sum 294.5 0.89 1.60 2e-07 297.1 0.88 1.59 2e-07
524288 131072 float sum 514.8 1.02 1.83 2e-07 512.5 1.02 1.84 2e-07
1048576 262144 float sum 964.5 1.09 1.96 2e-07 959.0 1.09 1.97 2e-07
2097152 524288 float sum 1844.9 1.14 2.05 2e-07 1849.4 1.13 2.04 2e-07
4194304 1048576 float sum 3614.1 1.16 2.09 2e-07 3611.0 1.16 2.09 2e-07
8388608 2097152 float sum 7127.9 1.18 2.12 2e-07 7131.3 1.18 2.12 2e-07
16777216 4194304 float sum 14088 1.19 2.14 2e-07 14084 1.19 2.14 2e-07
33554432 8388608 float sum 28241 1.19 2.14 2e-07 28231 1.19 2.14 2e-07
67108864 16777216 float sum 56461 1.19 2.14 2e-07 56525 1.19 2.14 2e-07
134217728 33554432 float sum 112597 1.19 2.15 2e-07 112519 1.19 2.15 2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.955939
Unless I compiled it incorrectly (nccl and nccl-tests), I'm still getting the same error:
NCCL_DEBUG=INFO NCCL_P2P_LEVEL=2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 146785 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 146785 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 146785 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 146785 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 146785 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 146785 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 146785 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 146785 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 146785 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 146785 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:146785:146785 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:146785:146785 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:146785:146785 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:146785:146785 [9] NCCL INFO nranks 10
node05-ccncluster:146785:146785 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:146785:146785 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:146785:146785 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:146785:146785 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:146785:146785 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:146785:146785 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:146785:146785 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:146785:146785 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:146785:146785 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:146785:146785 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:146785:146785 [9] NCCL INFO NCCL_P2P_LEVEL set by environment to 2.
node05-ccncluster:146785:146785 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:146785:146785 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:146785:146785 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:146785:146785 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:146785:146785 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:146785:146785 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:146785:146785 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:146785:146785 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:146785:146785 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:146785:146785 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:146785:146785 [8] transport/p2p.cc:487 NCCL WARN failed to peer with device 9(=9): 60 peer mapping resources exhausted
node05-ccncluster:146785:146785 [8] NCCL INFO init.cc:339 -> 3
node05-ccncluster:146785:146785 [8] NCCL INFO init.cc:1072 -> 3
node05-ccncluster:146785:146785 [8] NCCL INFO init.cc:1140 -> 3
node05-ccncluster:146785:146785 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:146785:146785 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:146785:146785 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:146785:146785 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:146785:146785 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:146785:146785 [8] NCCL INFO transport/shm.cc:229 -> 1
node05-ccncluster:146785:146785 [8] NCCL INFO channel.cc:47 -> 1
node05-ccncluster:146785:146785 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:146785:146785 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:146785:146785 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:146785:146785 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:146785:146785 [8] NCCL INFO init.cc:203 -> 1
Segmentation fault (core dumped)
Here are my compile commands to be sure:
mjlbach@node05-ccncluster:~/nccl$ sed -n 95,104p src/transport/p2p.cc
// See if CUDA can do P2P
/* int p2p; */
/* if (cudaDeviceCanAccessPeer(&p2p, myInfo->cudaDev, peerCudaDev) != cudaSuccess) { */
/* INFO(NCCL_INIT|NCCL_P2P,"peer query failed between dev %d(=%d) and dev %d(=%d)", */
/* myInfo->cudaDev, myInfo->nvmlDev, peerCudaDev, peerInfo->nvmlDev); */
/* return ncclSuccess; */
/* } */
/* if (p2p == 0) return ncclSuccess; */
mjlbach@node05-ccncluster:~/nccl$ make -j src.build
...
Archiving objects > /home/mjlbach/nccl/build/obj/collectives/device/colldevice.a
make[2]: Leaving directory '/home/mjlbach/nccl/src/collectives/device'
Linking libnccl.so.2.4.8 > /home/mjlbach/nccl/build/lib/libnccl.so.2.4.8
Archiving libnccl_static.a > /home/mjlbach/nccl/build/lib/libnccl_static.a
/home/mjlbach/nccl/src
make[1]: Leaving directory '/home/mjlbach/nccl/src'
mjlbach@node05-ccncluster:~/nccl$ pwd
/home/mjlbach/nccl
mjlbach@node05-ccncluster:~/nccl-tests$ make NCCL_HOME=/home/mjlbach/nccl/build
make -C src build
make[1]: Entering directory '/home/mjlbach/nccl-tests/src'
Compiling all_reduce.cu > ../build/all_reduce.o
Compiling common.cu > ../build/common.o
Linking ../build/all_reduce.o > ../build/all_reduce_perf
Compiling all_gather.cu > ../build/all_gather.o
Linking ../build/all_gather.o > ../build/all_gather_perf
Compiling broadcast.cu > ../build/broadcast.o
Linking ../build/broadcast.o > ../build/broadcast_perf
Compiling reduce_scatter.cu > ../build/reduce_scatter.o
Linking ../build/reduce_scatter.o > ../build/reduce_scatter_perf
Compiling reduce.cu > ../build/reduce.o
Linking ../build/reduce.o > ../build/reduce_perf
make[1]: Leaving directory '/home/mjlbach/nccl-tests/src'
mjlbach@node05-ccncluster:~/nccl-tests$ NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 160432 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 160432 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 160432 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 160432 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 160432 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 160432 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 160432 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 160432 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 160432 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 160432 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:160432:160432 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:160432:160432 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:160432:160432 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:160432:160432 [9] NCCL INFO nranks 10
node05-ccncluster:160432:160432 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:160432:160432 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:160432:160432 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:160432:160432 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:160432:160432 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:160432:160432 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:160432:160432 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:160432:160432 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:160432:160432 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:160432:160432 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:160432:160432 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:160432:160432 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:160432:160432 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:160432:160432 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:160432:160432 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:160432:160432 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:160432:160432 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:160432:160432 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:160432:160432 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:160432:160432 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:160432:160432 [8] transport/p2p.cc:487 NCCL WARN failed to peer with device 9(=9): 60 peer mapping resources exhausted
node05-ccncluster:160432:160432 [8] NCCL INFO init.cc:339 -> 3
node05-ccncluster:160432:160432 [8] NCCL INFO init.cc:1072 -> 3
node05-ccncluster:160432:160432 [8] NCCL INFO init.cc:1140 -> 3
node05-ccncluster:160432:160432 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:160432:160432 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:160432:160432 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:160432:160432 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:160432:160432 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:160432:160432 [8] NCCL INFO transport/shm.cc:229 -> 1
node05-ccncluster:160432:160432 [8] NCCL INFO channel.cc:47 -> 1
node05-ccncluster:160432:160432 [8] NCCL INFO init.cc:203 -> 1
node05-ccncluster:160432:160432 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:160432:160432 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:160432:160432 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:160432:160432 [8] NCCL INFO init.cc:203 -> 1
Segmentation fault (core dumped)
Do you have $HOME/nccl/build
in your LD_LIBRARY_PATH
? Could it be you're not using the version of NCCL you just recompiled ?
Can you ldd ./build/all_reduce_perf
to make sure we're using the correct libnccl.so ?
Actually you are not using it. The new version you recompiled was 2.4.8 and the log shows 2.4.7.
Yeah, I realize that now. Is there a reason why passing NCCL_HOME=/home/mjlbach/nccl/build to make doesn't link correctly? My LD_LIBRARY_PATH is also set to /home/mjlbach/nccl/build
Ok, got it linked correctly:
linux-vdso.so.1 => (0x00007ffed8728000)
libcudart.so.10.0 => /usr/local/cuda-10.0/lib64/libcudart.so.10.0 (0x00007fb35d21e000)
libnccl.so.2 => /home/mjlbach/nccl/build/lib/libnccl.so.2 (0x00007fb3567f5000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb3565d8000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb356256000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb355e8c000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb355c88000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb355a80000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb355777000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb355561000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb35d498000)
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 124690 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 124690 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 124690 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 124690 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 124690 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 124690 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 124690 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 124690 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 124690 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 124690 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:124690:124690 [0] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:124690:124690 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:124690:124690 [0] NCCL INFO NET/IB : No device found.
node05-ccncluster:124690:124690 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
NCCL version 2.4.8+cuda10.0
node05-ccncluster:124690:124690 [9] NCCL INFO nranks 10
node05-ccncluster:124690:124690 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:124690:124690 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:124690:124690 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:124690:124690 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:124690:124690 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:124690:124690 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:124690:124690 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:124690:124690 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:124690:124690 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:124690:124690 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:124690:124690 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
ode05-ccncluster:124690:124690 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:124690:124690 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:124690:124690 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:124690:124690 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:124690:124690 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:124690:124690 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:124690:124690 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:124690:124690 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:124690:124690 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:124690:124690 [8] transport/p2p.cc:488 NCCL WARN failed to peer with device 9(=9): 60 peer mapping resources exhausted
node05-ccncluster:124690:124690 [8] NCCL INFO init.cc:340 -> 3
node05-ccncluster:124690:124690 [8] NCCL INFO init.cc:1075 -> 3
node05-ccncluster:124690:124690 [8] NCCL INFO init.cc:1143 -> 3
node05-ccncluster:124690:124690 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:124690:124690 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:124690:124690 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:124690:124690 [8] NCCL INFO init.cc:204 -> 1
node05-ccncluster:124690:124690 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:124690:124690 [8] NCCL INFO transport/shm.cc:229 -> 1
node05-ccncluster:124690:124690 [8] NCCL INFO channel.cc:47 -> 1
node05-ccncluster:124690:124690 [8] NCCL INFO init.cc:204 -> 1
node05-ccncluster:124690:124690 [8] include/shm.h:63 NCCL WARN Cuda failure 'invalid argument'
node05-ccncluster:124690:124690 [8] NCCL INFO transport/shm.cc:237 -> 1
node05-ccncluster:124690:124690 [8] NCCL INFO channel.cc:48 -> 1
node05-ccncluster:124690:124690 [8] NCCL INFO init.cc:204 -> 1
Segmentation fault (core dumped)
Using 2.4.8
Thanks for confirming.
Could you check if you can reproduce the issue with a simple program calling cudaDeviceEnablePeerAccess() for each device on gpus n-1 and n+1 ?
#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#define CUDACHECK(cmd) do { \
cudaError_t e = cmd; \
if( e != cudaSuccess ) { \
printf("CUDA failure %s:%d '%s'\n", \
__FILE__,__LINE__,cudaGetErrorString(e)); \
exit(1); \
} \
} while(0)
int main() {
int ngpus;
CUDACHECK(cudaGetDeviceCount(&ngpus));
for (int g=0; g<ngpus; g++) {
printf("Enabling P2P for GPU %d\n", g);
CUDACHECK(cudaSetDevice(g));
CUDACHECK(cudaDeviceEnablePeerAccess((g-1+ngpus)%ngpus, 0));
CUDACHECK(cudaDeviceEnablePeerAccess((g+1)%ngpus, 0));
}
return 0;
}
I'm getting
mjlbach@node05-ccncluster:~/cuda_test$ ./a.out
Enabling P2P for GPU 0
CUDA failure test.cu:20 'peer mapping resources exhausted'
It works for 9, not 10
mjlbach@node05-ccncluster:~/cuda_test$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 ./a.out
Enabling P2P for GPU 0
Enabling P2P for GPU 1
Enabling P2P for GPU 2
Enabling P2P for GPU 3
Enabling P2P for GPU 4
Enabling P2P for GPU 5
Enabling P2P for GPU 6
Enabling P2P for GPU 7
Enabling P2P for GPU 8
mjlbach@node05-ccncluster:~/cuda_test$ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 ./a.out
Enabling P2P for GPU 0
CUDA failure test.cu:20 'peer mapping resources exhausted'
mjlbach@node05-ccncluster:~/cuda_test$
OK. Can you confirm your CPU model ? In particular is it Skylake ?
Other than that, can you check that things work correctly when using a separate process per GPU (i.e. compiling the tests with MPI, launching 10 MPI tasks and not using -g
).
Other idea : could it be you have an old process remaining which did enable P2P access between the first 9 GPUs, so that they are already out of peer mapping resources ? The CUDA documentation mentions «each device can support a system-wide maximum of eight peer connections». https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access
nvidia-smi
should tell you what is running on GPUs.
It's Skylake (Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz). The only thing I have running is xorg.
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2706 G /usr/lib/xorg/Xorg 26MiB |
| 1 2706 G /usr/lib/xorg/Xorg 11MiB |
| 2 2706 G /usr/lib/xorg/Xorg 11MiB |
| 3 2706 G /usr/lib/xorg/Xorg 11MiB |
| 4 2706 G /usr/lib/xorg/Xorg 11MiB |
| 5 2706 G /usr/lib/xorg/Xorg 11MiB |
| 6 2706 G /usr/lib/xorg/Xorg 11MiB |
| 7 2706 G /usr/lib/xorg/Xorg 11MiB |
| 8 2706 G /usr/lib/xorg/Xorg 11MiB |
| 9 2706 G /usr/lib/xorg/Xorg 11MiB |
+-----------------------------------------------------------------------------+
I'm still getting the error with MPI
(physics3)mjlbach@node05-ccncluster:~/nccl-tests$ bash compile_nccl_test.sh
make -C src build
linux-vdso.so.1 => (0x00007fffa23df000)
libcudart.so.10.0 => /usr/local/cuda-10.0/lib64/libcudart.so.10.0 (0x00007fdce3b7c000)
libmpi.so.12 => /usr/lib/libmpi.so.12 (0x00007fdce38a6000)
libnccl.so.2 => /home/mjlbach/nccl/build/lib/libnccl.so.2 (0x00007fdcdce7d000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fdcdcc60000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fdcdc8de000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdcdc514000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fdcdc310000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fdcdc108000)
libibverbs.so.1 => /usr/lib/libibverbs.so.1 (0x00007fdcdbef9000)
libopen-rte.so.12 => /usr/lib/libopen-rte.so.12 (0x00007fdcdbc7f000)
libopen-pal.so.13 => /usr/lib/libopen-pal.so.13 (0x00007fdcdb9e2000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fdcdb6d9000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fdcdb4c3000)
/lib64/ld-linux-x86-64.so.2 (0x00007fdce3df6000)
libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007fdcdb289000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fdcdb086000)
libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fdcdae7b000)
libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007fdcdac71000)
--------------------------------------------------------------------------
[[11192,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: node05-ccncluster
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 44741 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 44742 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 44743 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 44744 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 3 Pid 44744 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 44746 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 44747 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 44750 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 44753 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 44754 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 44755 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:44741:44741 [0] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44741:44741 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44741:44741 [0] NCCL INFO NET/IB : No device found.
node05-ccncluster:44741:44741 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
[node05-ccncluster:44728] 9 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[node05-ccncluster:44728] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
NCCL version 2.4.8+cuda10.0
node05-ccncluster:44744:44744 [3] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44744:44744 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44744:44744 [3] NCCL INFO NET/IB : No device found.
node05-ccncluster:44744:44744 [3] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44742:44742 [1] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44742:44742 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44742:44742 [1] NCCL INFO NET/IB : No device found.
node05-ccncluster:44742:44742 [1] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44753:44753 [7] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44753:44753 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44753:44753 [7] NCCL INFO NET/IB : No device found.
node05-ccncluster:44753:44753 [7] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44754:44754 [8] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44754:44754 [8] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44754:44754 [8] NCCL INFO NET/IB : No device found.
node05-ccncluster:44754:44754 [8] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44755:44755 [9] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44755:44755 [9] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44747:44747 [5] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44747:44747 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44755:44755 [9] NCCL INFO NET/IB : No device found.
node05-ccncluster:44747:44747 [5] NCCL INFO NET/IB : No device found.
node05-ccncluster:44755:44755 [9] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44747:44747 [5] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44750:44750 [6] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44750:44750 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:44746:44746 [4] NCCL INFO Bootstrap : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44743:44743 [2] NCCL INFO NET/IB : No device found. [15/6933]
node05-ccncluster:44746:44746 [4] NCCL INFO NET/IB : No device found.
node05-ccncluster:44750:44750 [6] NCCL INFO NET/IB : No device found.
node05-ccncluster:44746:44746 [4] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44750:44750 [6] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44743:44743 [2] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:44741:44805 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:44754:44808 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:44750:44811 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:44746:44812 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:44743:44813 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:44741:44805 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:44753:44810 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/IPC
node05-ccncluster:44754:44808 [8] NCCL INFO Ring 00 : 8[8] -> 9[9] via P2P/IPC
node05-ccncluster:44742:44807 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
node05-ccncluster:44743:44813 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
node05-ccncluster:44750:44811 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/IPC
node05-ccncluster:44744:44806 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
node05-ccncluster:44755:44814 [9] NCCL INFO Ring 00 : 9[9] -> 0[0] via direct shared memory
node05-ccncluster:44746:44812 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:44754:44808 [8] transport/p2p.cc:576 NCCL WARN failed to open CUDA IPC handle : 60 peer mapping resources exhausted
node05-ccncluster:44754:44808 [8] NCCL INFO init.cc:669 -> 1
node05-ccncluster:44754:44808 [8] NCCL INFO init.cc:815 -> 1
node05-ccncluster:44754:44808 [8] NCCL INFO init.cc:953 -> 1
node05-ccncluster:44754:44808 [8] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
node05-ccncluster: Test NCCL failure common.cu:782 'unhandled cuda error'
node05-ccncluster:44747:44809 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/IPC
node05-ccncluster:44741:44805 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
node05-ccncluster:44755:44814 [9] transport/p2p.cc:606 NCCL WARN failed to open CUDA IPC handle : 60 peer mapping resources exhausted
node05-ccncluster:44755:44814 [9] NCCL INFO init.cc:679 -> 1
node05-ccncluster:44755:44814 [9] NCCL INFO init.cc:815 -> 1
node05-ccncluster:44755:44814 [9] NCCL INFO init.cc:953 -> 1
node05-ccncluster:44755:44814 [9] NCCL INFO misc/group.cc:69 -> 1 [Async thread]
node05-ccncluster: Test NCCL failure common.cu:782 'unhandled cuda error'
node05-ccncluster:44741:44805 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:44747:44809 [5] NCCL INFO comm 0x2c00870 rank 5 nranks 10 cudaDev 5 nvmlDev 5 - Init COMPLETE
node05-ccncluster:44753:44810 [7] NCCL INFO comm 0xd085f0 rank 7 nranks 10 cudaDev 7 nvmlDev 7 - Init COMPLETE
node05-ccncluster:44746:44812 [4] NCCL INFO comm 0x210bff0 rank 4 nranks 10 cudaDev 4 nvmlDev 4 - Init COMPLETE
node05-ccncluster:44743:44813 [2] NCCL INFO comm 0x182a1f0 rank 2 nranks 10 cudaDev 2 nvmlDev 2 - Init COMPLETE
node05-ccncluster:44750:44811 [6] NCCL INFO comm 0x108f790 rank 6 nranks 10 cudaDev 6 nvmlDev 6 - Init COMPLETE
node05-ccncluster:44742:44807 [1] NCCL INFO comm 0x2247970 rank 1 nranks 10 cudaDev 1 nvmlDev 1 - Init COMPLETE
node05-ccncluster:44744:44806 [3] NCCL INFO comm 0x2a92ad0 rank 3 nranks 10 cudaDev 3 nvmlDev 3 - Init COMPLETE
node05-ccncluster:44741:44805 [0] NCCL INFO comm 0x16f8540 rank 0 nranks 10 cudaDev 0 nvmlDev 0 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
node05-ccncluster:44741:44741 [0] NCCL INFO Launch mode Parallel
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[11192,1],8]
Exit code: 3
--------------------------------------------------------------------------
Would it work better if you stop Xorg (or only run Xorg on a single GPU) ?
It appears running X is the issue. It works now.
OK so it would seem Xorg was indeed enabling p2p between GPUs (even if it probably silently failed to enable it for GPU 9), causing GPUs 0-8 to have all their connections used already and making GPU 8 unable to connect to GPU 9.
Note that by default NCCL does not use p2p across PCI root complexes on Skylake (so it uses CPU memory for 4->5 and 9->0). You might want to set NCCL_P2P_LEVEL=5
to force the use of p2p across root complexes and see if it improves performance (it depends on your BIOS and your system).
It seems like NCCL_P2P_LEVEL=5 reduces performance:
(physics3)mjlbach@node05-ccncluster:~/nccl-tests$ NCCL_P2P_LEVEL=2 NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 192563 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 192563 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 192563 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 192563 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 192563 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 192563 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 192563 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 192563 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 192563 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 192563 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:192563:192563 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:192563:192563 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:192563:192563 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:192563:192563 [9] NCCL INFO nranks 10
node05-ccncluster:192563:192563 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:192563:192563 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:192563:192563 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:192563:192563 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:192563:192563 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:192563:192563 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:192563:192563 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:192563:192563 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:192563:192563 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:192563:192563 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:192563:192563 [9] NCCL INFO NCCL_P2P_LEVEL set by environment to 2.
node05-ccncluster:192563:192563 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:192563:192563 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:192563:192563 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:192563:192563 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:192563:192563 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:192563:192563 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:192563:192563 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via direct shared memory
node05-ccncluster:192563:192563 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:192563:192563 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:192563:192563 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:192563:192563 [8] NCCL INFO Ring 00 : 8[8] -> 9[9] via P2P/direct pointer
node05-ccncluster:192563:192563 [9] NCCL INFO Ring 00 : 9[9] -> 0[0] via direct shared memory
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
node05-ccncluster:192563:192563 [0] NCCL INFO Launch mode Group/CGMD
8 2 float sum 48.39 0.00 0.00 1e-07 47.03 0.00 0.00 1e-07
16 4 float sum 49.03 0.00 0.00 1e-07 44.93 0.00 0.00 1e-07
32 8 float sum 48.37 0.00 0.00 6e-08 44.79 0.00 0.00 6e-08
64 16 float sum 48.10 0.00 0.00 6e-08 47.18 0.00 0.00 6e-08
128 32 float sum 48.28 0.00 0.00 6e-08 46.70 0.00 0.00 6e-08
256 64 float sum 48.27 0.01 0.01 3e-08 46.44 0.01 0.01 3e-08
512 128 float sum 48.13 0.01 0.02 3e-08 44.97 0.01 0.02 3e-08
1024 256 float sum 48.26 0.02 0.04 2e-07 49.24 0.02 0.04 2e-07
2048 512 float sum 49.33 0.04 0.07 2e-07 48.88 0.04 0.08 2e-07
4096 1024 float sum 49.18 0.08 0.15 2e-07 46.60 0.09 0.16 2e-07
8192 2048 float sum 49.86 0.16 0.30 2e-07 47.65 0.17 0.31 2e-07
16384 4096 float sum 52.40 0.31 0.56 2e-07 52.17 0.31 0.57 2e-07
32768 8192 float sum 75.59 0.43 0.78 2e-07 75.25 0.44 0.78 2e-07
65536 16384 float sum 124.2 0.53 0.95 2e-07 126.3 0.52 0.93 2e-07
131072 32768 float sum 146.7 0.89 1.61 2e-07 146.5 0.89 1.61 2e-07
262144 65536 float sum 181.5 1.44 2.60 2e-07 180.8 1.45 2.61 2e-07
524288 131072 float sum 232.6 2.25 4.06 2e-07 232.5 2.26 4.06 2e-07
1048576 262144 float sum 327.4 3.20 5.76 2e-07 327.3 3.20 5.77 2e-07
2097152 524288 float sum 523.0 4.01 7.22 2e-07 520.5 4.03 7.25 2e-07
4194304 1048576 float sum 932.7 4.50 8.09 2e-07 931.0 4.51 8.11 2e-07
8388608 2097152 float sum 1759.9 4.77 8.58 2e-07 1756.9 4.77 8.59 2e-07
16777216 4194304 float sum 3472.6 4.83 8.70 2e-07 3471.5 4.83 8.70 2e-07
33554432 8388608 float sum 6906.9 4.86 8.74 2e-07 6908.8 4.86 8.74 2e-07
67108864 16777216 float sum 13816 4.86 8.74 2e-07 13785 4.87 8.76 2e-07
134217728 33554432 float sum 27501 4.88 8.78 2e-07 27527 4.88 8.78 2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.03331
vs.
(physics3)mjlbach@node05-ccncluster:~/nccl-tests$ NCCL_P2P_LEVEL=5 NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 10
# nThread 1 nGpus 10 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 192642 on node05-ccncluster device 0 [0x1a] TITAN Xp
# Rank 1 Pid 192642 on node05-ccncluster device 1 [0x1b] TITAN Xp
# Rank 2 Pid 192642 on node05-ccncluster device 2 [0x1c] TITAN Xp
# Rank 3 Pid 192642 on node05-ccncluster device 3 [0x1d] TITAN Xp
# Rank 4 Pid 192642 on node05-ccncluster device 4 [0x1e] TITAN Xp
# Rank 5 Pid 192642 on node05-ccncluster device 5 [0x3d] TITAN Xp
# Rank 6 Pid 192642 on node05-ccncluster device 6 [0x3e] TITAN Xp
# Rank 7 Pid 192642 on node05-ccncluster device 7 [0x3f] TITAN Xp
# Rank 8 Pid 192642 on node05-ccncluster device 8 [0x40] TITAN Xp
# Rank 9 Pid 192642 on node05-ccncluster device 9 [0x41] TITAN Xp
node05-ccncluster:192642:192642 [0] NCCL INFO NET/Socket : Using [0]enp96s0f0:10.102.2.200<0> [1]enp134s0:192.168.4.105<0>
node05-ccncluster:192642:192642 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
node05-ccncluster:192642:192642 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.7+cuda10.0
node05-ccncluster:192642:192642 [9] NCCL INFO nranks 10
node05-ccncluster:192642:192642 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
node05-ccncluster:192642:192642 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
node05-ccncluster:192642:192642 [2] NCCL INFO Setting affinity for GPU 2 to 0f,ff000fff
node05-ccncluster:192642:192642 [3] NCCL INFO Setting affinity for GPU 3 to 0f,ff000fff
node05-ccncluster:192642:192642 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ff000fff
node05-ccncluster:192642:192642 [5] NCCL INFO Setting affinity for GPU 5 to 0f,ff000fff
node05-ccncluster:192642:192642 [6] NCCL INFO Setting affinity for GPU 6 to 0f,ff000fff
node05-ccncluster:192642:192642 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ff000fff
node05-ccncluster:192642:192642 [8] NCCL INFO Setting affinity for GPU 8 to 0f,ff000fff
node05-ccncluster:192642:192642 [9] NCCL INFO Setting affinity for GPU 9 to 0f,ff000fff
node05-ccncluster:192642:192642 [9] NCCL INFO NCCL_P2P_LEVEL set by environment to 5.
node05-ccncluster:192642:192642 [9] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
node05-ccncluster:192642:192642 [9] NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7 8 9
node05-ccncluster:192642:192642 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
node05-ccncluster:192642:192642 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/direct pointer
node05-ccncluster:192642:192642 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
node05-ccncluster:192642:192642 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/direct pointer
node05-ccncluster:192642:192642 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/direct pointer
node05-ccncluster:192642:192642 [5] NCCL INFO Ring 00 : 5[5] -> 6[6] via P2P/direct pointer
node05-ccncluster:192642:192642 [6] NCCL INFO Ring 00 : 6[6] -> 7[7] via P2P/direct pointer
node05-ccncluster:192642:192642 [7] NCCL INFO Ring 00 : 7[7] -> 8[8] via P2P/direct pointer
node05-ccncluster:192642:192642 [8] NCCL INFO Ring 00 : 8[8] -> 9[9] via P2P/direct pointer
node05-ccncluster:192642:192642 [9] NCCL INFO Ring 00 : 9[9] -> 0[0] via P2P/direct pointer
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
node05-ccncluster:192642:192642 [0] NCCL INFO Launch mode Group/CGMD
8 2 float sum 47.28 0.00 0.00 1e-07 44.82 0.00 0.00 1e-07
16 4 float sum 47.89 0.00 0.00 1e-07 44.65 0.00 0.00 1e-07
32 8 float sum 48.42 0.00 0.00 6e-08 44.46 0.00 0.00 6e-08
64 16 float sum 48.55 0.00 0.00 6e-08 44.71 0.00 0.00 6e-08
128 32 float sum 47.81 0.00 0.00 6e-08 44.65 0.00 0.01 6e-08
256 64 float sum 48.22 0.01 0.01 3e-08 45.00 0.01 0.01 3e-08
512 128 float sum 48.17 0.01 0.02 3e-08 44.84 0.01 0.02 3e-08
1024 256 float sum 48.51 0.02 0.04 2e-07 44.84 0.02 0.04 2e-07
2048 512 float sum 49.02 0.04 0.08 2e-07 44.77 0.05 0.08 2e-07
4096 1024 float sum 48.88 0.08 0.15 2e-07 45.29 0.09 0.16 2e-07
8192 2048 float sum 49.55 0.17 0.30 2e-07 46.29 0.18 0.32 2e-07
16384 4096 float sum 47.99 0.34 0.61 2e-07 45.96 0.36 0.64 2e-07
32768 8192 float sum 48.44 0.68 1.22 2e-07 46.11 0.71 1.28 2e-07
65536 16384 float sum 58.49 1.12 2.02 2e-07 56.48 1.16 2.09 2e-07
131072 32768 float sum 145.6 0.90 1.62 2e-07 145.2 0.90 1.62 2e-07
262144 65536 float sum 218.6 1.20 2.16 2e-07 218.4 1.20 2.16 2e-07
524288 131072 float sum 371.8 1.41 2.54 2e-07 366.9 1.43 2.57 2e-07
1048576 262144 float sum 838.1 1.25 2.25 2e-07 855.9 1.23 2.21 2e-07
2097152 524288 float sum 1775.3 1.18 2.13 2e-07 1756.3 1.19 2.15 2e-07
4194304 1048576 float sum 3672.4 1.14 2.06 2e-07 3694.9 1.14 2.04 2e-07
8388608 2097152 float sum 8279.1 1.01 1.82 2e-07 8578.3 0.98 1.76 2e-07
16777216 4194304 float sum 16272 1.03 1.86 2e-07 16393 1.02 1.84 2e-07
33554432 8388608 float sum 33538 1.00 1.80 2e-07 33174 1.01 1.82 2e-07
67108864 16777216 float sum 68369 0.98 1.77 2e-07 68442 0.98 1.76 2e-07
134217728 33554432 float sum 137732 0.97 1.75 2e-07 138225 0.97 1.75 2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.05095
#
Also, strangely tensorflow is still throwing an NCCL error:
2019-07-25 09:33:34.093469: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 0 and 9, status: Internal: failed to enable
peer access from 0x7f76d46da2c0 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.105255: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 1 and 9, status: Internal: failed to enable
peer access from 0x7f76cc74aba0 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.115187: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 2 and 9, status: Internal: failed to enable
peer access from 0x7f76b071ae60 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.123710: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 3 and 9, status: Internal: failed to enable
peer access from 0x7f76c870a380 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.134640: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 4 and 9, status: Internal: failed to enable
peer access from 0x7f76c47226a0 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.145270: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 5 and 9, status: Internal: failed to enable
peer access from 0x7f76c0740bc0 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.149014: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 6 and 9, status: Internal: failed to enable
peer access from 0x7f76d072a8e0 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.151168: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 7 and 9, status: Internal: failed to enable
peer access from 0x7f76bc731660 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.151761: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 8 and 9, status: Internal: failed to enable
peer access from 0x7f76b87398c0 to 0x7f76b4713830: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.151981: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 0, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76d46da2c0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.152197: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 1, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76cc74aba0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.152413: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 2, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76b071ae60: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.152628: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 3, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76c870a380: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.152846: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 4, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76c47226a0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.153051: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 5, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76c0740bc0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.153254: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 6, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76d072a8e0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.153459: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 7, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76bc731660: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.153664: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1597] Unable to enable peer access between device ordinals 9 and 8, status: Internal: failed to enable
peer access from 0x7f76b4713830 to 0x7f76b87398c0: CUDA_ERROR_TOO_MANY_PEERS: peer mapping resources exhausted
2019-07-25 09:33:34.156023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:1a:00.0
The performance you see with NCCL_P2P_LEVEL=5
is the reason why it is not the default : your BIOS seems to not implement the proper fix for P2P through Skylake root complexes (hence getting 1.5GB/s instead of ~9 GB/s).
About Tensorflow, it seems like an issue with the Tensorflow multi-GPU code (not NCCL). Using Tensorflow+Horovod should work at this point.
Closing this issue as Xorg was found to be the culprit
i cant believe it was xorg
I was attempting to use distributed tensorflow when I noticed I could not add the 10th gpu on my node to a distributed strategy... After running nccl-tests, I noticed it appears to be an issue with NCCL.