NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Test NCCL failure common.cu:954 'unhandled cuda error #208

Closed YingYellow closed 3 months ago

YingYellow commented 3 months ago

Hello, I find this error when I run the following command. Can you help me with that? Thanks!

截屏2024-04-20 08 40 14

Here are some log messages:

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices Rank 0 Group 0 Pid 1896106 on noodle-4090-0 device 0 [0x16] NVIDIA GeForce RTX 4090 Rank 1 Group 0 Pid 1896106 on noodle-4090-0 device 1 [0x34] NVIDIA GeForce RTX 4090

noodle-4090-0:1896106:1896106 [0] NCCL INFO Bootstrap : Using eno2:10.33.48.72<0> noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: Using internal network plugin. noodle-4090-0:1896106:1896106 [1] NCCL INFO cudaDriverVersion 12020 NCCL version 2.21.5+cuda11.8

noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

noodle-4090-0:1896106:1896121 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'

noodle-4090-0:1896106:1896121 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'

noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found' .... noodle-4090-0:1896106:1896120 [0] NCCL INFO init.cc:1516 -> 1 noodle-4090-0:1896106:1896120 [0] NCCL INFO group.cc:64 -> 1 [Async thread] noodle-4090-0:1896106:1896106 [1] NCCL INFO group.cc:418 -> 1 noodle-4090-0:1896106:1896106 [1] NCCL INFO group.cc:95 -> 1 noodle-4090-0:1896106:1896106 [1] NCCL INFO init.cc:1892 -> 1 noodle-4090-0: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. noodle-4090-0 pid 1896106: Test failure common.cu:844

YingYellow commented 3 months ago

Kindly ignore the above comment. I solved this problem by updating the version of cuda.