Using devices
Rank 0 Group 0 Pid 1896106 on noodle-4090-0 device 0 [0x16] NVIDIA GeForce RTX 4090
Rank 1 Group 0 Pid 1896106 on noodle-4090-0 device 1 [0x34] NVIDIA GeForce RTX 4090
noodle-4090-0:1896106:1896106 [0] NCCL INFO Bootstrap : Using eno2:10.33.48.72<0>
noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: Using internal network plugin.
noodle-4090-0:1896106:1896106 [1] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda11.8
noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
....
noodle-4090-0:1896106:1896120 [0] NCCL INFO init.cc:1516 -> 1
noodle-4090-0:1896106:1896120 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
noodle-4090-0:1896106:1896106 [1] NCCL INFO group.cc:418 -> 1
noodle-4090-0:1896106:1896106 [1] NCCL INFO group.cc:95 -> 1
noodle-4090-0:1896106:1896106 [1] NCCL INFO init.cc:1892 -> 1
noodle-4090-0: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. noodle-4090-0 pid 1896106: Test failure common.cu:844
Hello, I find this error when I run the following command. Can you help me with that? Thanks!
Here are some log messages:
nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices Rank 0 Group 0 Pid 1896106 on noodle-4090-0 device 0 [0x16] NVIDIA GeForce RTX 4090 Rank 1 Group 0 Pid 1896106 on noodle-4090-0 device 1 [0x34] NVIDIA GeForce RTX 4090
noodle-4090-0:1896106:1896106 [0] NCCL INFO Bootstrap : Using eno2:10.33.48.72<0> noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so) noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so noodle-4090-0:1896106:1896106 [0] NCCL INFO NET/Plugin: Using internal network plugin. noodle-4090-0:1896106:1896106 [1] NCCL INFO cudaDriverVersion 12020 NCCL version 2.21.5+cuda11.8
noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:58 NCCL WARN Cuda failure 'named symbol not found'
noodle-4090-0:1896106:1896121 [1] enqueue.cc:47 NCCL WARN Cuda failure 'named symbol not found' .... noodle-4090-0:1896106:1896120 [0] NCCL INFO init.cc:1516 -> 1 noodle-4090-0:1896106:1896120 [0] NCCL INFO group.cc:64 -> 1 [Async thread] noodle-4090-0:1896106:1896106 [1] NCCL INFO group.cc:418 -> 1 noodle-4090-0:1896106:1896106 [1] NCCL INFO group.cc:95 -> 1 noodle-4090-0:1896106:1896106 [1] NCCL INFO init.cc:1892 -> 1 noodle-4090-0: Test NCCL failure common.cu:954 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. noodle-4090-0 pid 1896106: Test failure common.cu:844