NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
894 stars 240 forks source link

what is cu:990 error? how to solve this problem? #230

Open MAKER-park opened 4 months ago

MAKER-park commented 4 months ago

thank you for attention this problem. my workstation spec is RTX A4000 *2 WSL2_Ubuntu-22.04 cudnn 8.9 (base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ nvidia-smi Fri Jun 28 05:15:17 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA RTX A4000 On | 00000000:65:00.0 Off | Off | | 41% 37C P8 6W / 140W | 17MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A4000 On | 00000000:B3:00.0 On | Off | | 41% 37C P8 7W / 140W | 571MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 31 G /Xwayland N/A | | 1 N/A N/A 31 G /Xwayland N/A | +---------------------------------------------------------------------------------------+

and i run this command when i done make command (base) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ mpirun -np 2 --allow-run-as-root -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 64755 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000

Rank 0 Group 0 Pid 64756 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000

DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0> DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0> DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO cudaDriverVersion 12040 DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO NCCL version 2.22.3+cuda12.0 DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO cudaDriverVersion 12040 DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO NCCL version 2.22.3+cuda12.0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/IB : No device found. DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> [1]veth9d2d103:fe80::b051:e3ff:febe:607b%veth9d2d103<0> DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Using network Socket DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/IB : No device found. DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> [1]veth9d2d103:fe80::b051:e3ff:febe:607b%veth9d2d103<0> DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Using network Socket DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO ncclCommInitRank comm 0x5567a8af3a00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 65000 commId 0x1e41f00635db9132 - Init START DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO comm 0x5567a8af3a00 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 00/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 01/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 02/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 03/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 04/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 05/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 06/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 07/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 08/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 09/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 10/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 11/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 12/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 13/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 14/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 15/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 16/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 17/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 18/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 19/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 20/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 21/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 22/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 23/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 24/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 25/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 26/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 27/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 28/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 29/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 30/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Channel 31/32 : 0 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO P2P Chunksize set to 131072

DESKTOP-GGBQPHK:64756:64771 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error' DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO include/alloc.h:215 -> 1 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO channel.cc:42 -> 1 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:544 -> 1 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:1156 -> 1 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO init.cc:1408 -> 1 DESKTOP-GGBQPHK:64756:64771 [0] NCCL INFO group.cc:70 -> 1 [Async thread] DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:420 -> 1 DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:546 -> 1 DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO group.cc:101 -> 1 DESKTOP-GGBQPHK:64756:64756 [0] NCCL INFO init.cc:1761 -> 1 DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. DESKTOP-GGBQPHK pid 64756: Test failure common.cu:876 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO ncclCommInitRank comm 0x5567734f9a00 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 65000 commId 0xd904a9f238296abf - Init START DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO comm 0x5567734f9a00 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 00/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 01/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 02/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 03/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 04/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 05/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 06/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 07/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 08/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 09/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 10/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 11/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 12/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 13/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 14/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 15/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 16/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 17/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 18/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 19/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 20/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 21/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 22/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 23/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 24/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 25/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 26/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 27/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 28/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 29/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 30/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Channel 31/32 : 0 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO P2P Chunksize set to 131072

DESKTOP-GGBQPHK:64755:64773 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error' DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO include/alloc.h:215 -> 1 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO channel.cc:42 -> 1 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:544 -> 1 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:1156 -> 1 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO init.cc:1408 -> 1 DESKTOP-GGBQPHK:64755:64773 [0] NCCL INFO group.cc:70 -> 1 [Async thread] DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:420 -> 1 DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:546 -> 1 DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO group.cc:101 -> 1 DESKTOP-GGBQPHK:64755:64755 [0] NCCL INFO init.cc:1761 -> 1 DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. DESKTOP-GGBQPHK pid 64755: Test failure common.cu:876

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[26406,1],1] Exit code: 3

what happend this problem i just want tensorflow multi gpu for reduce vram stress...

AddyLaddy commented 4 months ago

It looks like you're running a single process test twice and they are both using the same device . You need to compile the nccl-tests with MPI=1 for this to work.

MAKER-park commented 4 months ago

@AddyLaddy thank you for reply!

then you mean instead of 'make' use 'make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl' ?

but this command is not working in my system...

here my result (TF) heartlab@DESKTOP-GGBQPHK:~/nccl-tests$ make MPI=1 MPI_HOME=/usr CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/heartlab/anaconda3/envs/TF make -C src build BUILDDIR=/home/heartlab/nccl-tests/build make[1]: Entering directory '/home/heartlab/nccl-tests/src' Compiling timer.cc > /home/heartlab/nccl-tests/build/timer.o Compiling /home/heartlab/nccl-tests/build/verifiable/verifiable.o Compiling all_reduce.cu > /home/heartlab/nccl-tests/build/all_reduce.o In file included from all_reduce.cu:8: common.h:14:10: fatal error: mpi.h: No such file or directory 14 | #include "mpi.h" | ^~~ compilation terminated. make[1]: [Makefile:94: /home/heartlab/nccl-tests/build/all_reduce.o] Error 1 make[1]: Leaving directory '/home/heartlab/nccl-tests/src' make: [Makefile:20: src.build] Error 2

clearly, install mpi and setup my bashrc file

export PATH=/usr/bin:$PATH export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/openmpi/lib:$LD_LIBRARY_PATH export C_INCLUDE_PATH=/usr/lib/x86_64-linux-gnu/openmpi/include:/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi:$C_INCLUDE_PATH

but is okay just 'make' command. what happened in my case. haha....

MAKER-park commented 4 months ago

@AddyLaddy
i found mpi library location

' make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/heartlab/anaconda3/envs/TF'

make -C src build BUILDDIR=/home/heartlab/nccl-tests/build make[1]: Entering directory '/home/heartlab/nccl-tests/src' Compiling timer.cc > /home/heartlab/nccl-tests/build/timer.o Compiling /home/heartlab/nccl-tests/build/verifiable/verifiable.o Compiling all_reduce.cu > /home/heartlab/nccl-tests/build/all_reduce.o Compiling common.cu > /home/heartlab/nccl-tests/build/common.o Linking /home/heartlab/nccl-tests/build/all_reduce.o > /home/heartlab/nccl-tests/build/all_reduce_perf Compiling all_gather.cu > /home/heartlab/nccl-tests/build/all_gather.o Linking /home/heartlab/nccl-tests/build/all_gather.o > /home/heartlab/nccl-tests/build/all_gather_perf Compiling broadcast.cu > /home/heartlab/nccl-tests/build/broadcast.o Linking /home/heartlab/nccl-tests/build/broadcast.o > /home/heartlab/nccl-tests/build/broadcast_perf Compiling reduce_scatter.cu > /home/heartlab/nccl-tests/build/reduce_scatter.o Linking /home/heartlab/nccl-tests/build/reduce_scatter.o > /home/heartlab/nccl-tests/build/reduce_scatter_perf Compiling reduce.cu > /home/heartlab/nccl-tests/build/reduce.o Linking /home/heartlab/nccl-tests/build/reduce.o > /home/heartlab/nccl-tests/build/reduce_perf Compiling alltoall.cu > /home/heartlab/nccl-tests/build/alltoall.o Linking /home/heartlab/nccl-tests/build/alltoall.o > /home/heartlab/nccl-tests/build/alltoall_perf Compiling scatter.cu > /home/heartlab/nccl-tests/build/scatter.o Linking /home/heartlab/nccl-tests/build/scatter.o > /home/heartlab/nccl-tests/build/scatter_perf Compiling gather.cu > /home/heartlab/nccl-tests/build/gather.o Linking /home/heartlab/nccl-tests/build/gather.o > /home/heartlab/nccl-tests/build/gather_perf Compiling sendrecv.cu > /home/heartlab/nccl-tests/build/sendrecv.o Linking /home/heartlab/nccl-tests/build/sendrecv.o > /home/heartlab/nccl-tests/build/sendrecv_perf Compiling hypercube.cu > /home/heartlab/nccl-tests/build/hypercube.o Linking /home/heartlab/nccl-tests/build/hypercube.o > /home/heartlab/nccl-tests/build/hypercube_perf make[1]: Leaving directory '/home/heartlab/nccl-tests/src' maybe compile done

and run this command for test

mpirun -np 1 ./build/all_reduce_perf -b 8 -e 64M -f 2 -g 2

`# nThread 1 nGpus 2 minBytes 8 maxBytes 67108864 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 #

Using devices

Rank 0 Group 0 Pid 137045 on DESKTOP-GGBQPHK device 0 [0x65] NVIDIA RTX A4000

Rank 1 Group 0 Pid 137045 on DESKTOP-GGBQPHK device 1 [0xb3] NVIDIA RTX A4000

DESKTOP-GGBQPHK:137045:137045 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 DESKTOP-GGBQPHK:137045:137045 [0] NCCL INFO Bootstrap : Using eth0:172.30.81.89<0> DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO cudaDriverVersion 12040 DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO NCCL version 2.22.3+cuda12.0 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin. DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NET/Socket : Using [0]eth0:172.30.81.89<0> DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Using network Socket DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO Using network Socket DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO ncclCommInitRank comm 0x563ca4501b40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 65000 commId 0xc573bf65f65fb713 - Init START DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO ncclCommInitRank comm 0x563ca4540ec0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId b3000 commId 0xc573bf65f65fb713 - Init START DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /dev/null DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO NCCL_TOPO_FILE set by environment to /dev/null DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to NVL DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 1. DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO comm 0x563ca4501b40 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Channel 00/02 : 0 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Channel 01/02 : 0 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 1048576. DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO P2P Chunksize set to 131072 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO comm 0x563ca4540ec0 rank 1 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO P2P Chunksize set to 131072

DESKTOP-GGBQPHK:137045:137059 [0] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error' DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO include/alloc.h:215 -> 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO channel.cc:42 -> 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO init.cc:544 -> 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO init.cc:1156 -> 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO init.cc:1408 -> 1 DESKTOP-GGBQPHK:137045:137059 [0] NCCL INFO group.cc:70 -> 1 [Async thread]

DESKTOP-GGBQPHK:137045:137060 [1] include/alloc.h:123 NCCL WARN Cuda failure 999 'unknown error' DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO include/alloc.h:215 -> 1 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO channel.cc:42 -> 1 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO init.cc:544 -> 1 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO init.cc:1156 -> 1 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO init.cc:1408 -> 1 DESKTOP-GGBQPHK:137045:137060 [1] NCCL INFO group.cc:70 -> 1 [Async thread] DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO group.cc:420 -> 1 DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO group.cc:546 -> 1 DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO group.cc:101 -> 1 DESKTOP-GGBQPHK:137045:137045 [1] NCCL INFO init.cc:1761 -> 1 DESKTOP-GGBQPHK: Test NCCL failure common.cu:990 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. DESKTOP-GGBQPHK pid 137045: Test failure common.cu:876

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[35994,1],0] Exit code: 3

`

same result came out.

wsl problem? or RTX A4000 not good for multi use GPU learning system?

AddyLaddy commented 4 months ago

Ok that looks better. But the same CUDA error. But I don't know which RTX parts now support multi-GPU communications. There is also the nvbandwidth tool to check CUDA P2P transfers.

kiskra-nvidia commented 4 months ago
NCCL INFO NCCL_P2P_LEVEL set by environment to NVL

The above looks suspect -- as far as I can tell, A4000 does not support NVLink?!

Perhaps the following link is relevant, as it references A4000 and the same error code 999 from cuMemSetAccess: https://forums.developer.nvidia.com/t/rivermax-sdk-example-code-run-failed/255548

Finally, you can probably work around this issue by running with NCCL_CUMEM_ENABLE=0