NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

unhandled cuda error during test #170

Closed mlinmg closed 11 months ago

mlinmg commented 11 months ago

When i try to test with ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 it gaves:

# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   8347 on DESKTOP-VMBL43V device  0 [0x01] NVIDIA GeForce RTX 3090 Ti
#  Rank  1 Group  0 Pid   8347 on DESKTOP-VMBL43V device  1 [0x04] NVIDIA GeForce RTX 3090 Ti
DESKTOP-VMBL43V:8347:8347 [0] NCCL INFO Bootstrap : Using eth0:172.23.125.43<0>
DESKTOP-VMBL43V:8347:8347 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
DESKTOP-VMBL43V:8347:8347 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
DESKTOP-VMBL43V:8347:8347 [1] NCCL INFO cudaDriverVersion 12010
NCCL version 2.18.3+cuda12.1
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO NET/IB : No device found.
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO NET/Socket : Using [0]eth0:172.23.125.43<0>
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO Using network Socket
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO Using network Socket
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO comm 0x55f193bd70e0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 4000 commId 0x35877405ed0e9b4f - Init START
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO comm 0x55f193b8c210 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x35877405ed0e9b4f - Init START
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO NCCL_IGNORE_DISABLED_P2P set by environment to 1.
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO P2P Chunksize set to 131072
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO Channel 00/02 :    0   1
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO Channel 01/02 :    0   1
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO P2P Chunksize set to 131072
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct

DESKTOP-VMBL43V:8347:8353 [0] transport.cc:154 NCCL WARN Cuda failure 'invalid argument'
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO init.cc:1079 -> 1
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO init.cc:1358 -> 1
DESKTOP-VMBL43V:8347:8353 [0] NCCL INFO group.cc:65 -> 1 [Async thread]

DESKTOP-VMBL43V:8347:8354 [1] transport.cc:154 NCCL WARN Cuda failure 'invalid argument'
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO init.cc:1079 -> 1
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO init.cc:1358 -> 1
DESKTOP-VMBL43V:8347:8354 [1] NCCL INFO group.cc:65 -> 1 [Async thread]
DESKTOP-VMBL43V:8347:8347 [1] NCCL INFO group.cc:406 -> 1
DESKTOP-VMBL43V:8347:8347 [1] NCCL INFO group.cc:96 -> 1
DESKTOP-VMBL43V:8347:8347 [1] NCCL INFO init.cc:1691 -> 1
DESKTOP-VMBL43V: Test NCCL failure common.cu:953 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. DESKTOP-VMBL43V pid 8347: Test failure common.cu:844

Anyone knows why? I'm in wsl 2, nvidia-smi: +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.46 Driver Version: 531.61 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Ti On | 00000000:01:00.0 On | Off | | 30% 37C P0 81W / 480W| 574MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Ti On | 00000000:04:00.0 Off | Off | | 46% 28C P8 13W / 450W| 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 23 G /Xwayland N/A | | 1 N/A N/A 23 G /Xwayland N/A | +---------------------------------------------------------------------------------------+ nvcc -V: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Feb__7_19:32:13_PST_2023 Cuda compilation tools, release 12.1, V12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0

mlinmg commented 11 months ago

compiling my version of nccl fixed the error