Closed findmyway closed 1 month ago
By setting NCCL_DEBUG=INFO
I got the following error msg:
NCCL WARN Cuda failure 'initialization error'
It seems in the original pipeline, there are some extra configurations
By setting LocalPreferences.toml
to
[CUDA_Runtime_jll]
version = "12.3"
[CUDA_Driver_jll]
compat = "false"
Now I can at least initialize NCCL.Communicators
, all collective operations hit the following error
sum: Error During Test at NCCL.jl/test/runtests.jl:29
Got exception outside of a @test
ArgumentError: cannot take the GPU address of inaccessible device memory.
You are trying to use memory from GPU 0 on GPU 7.
P2P access between these devices is not possible; either switch to GPU 0
by calling `CUDA.device!(0)`, or copy the data to an array allocated on device 7.
The error is triggered when converting CuArray into a CuPtr. And the root cause is that
https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/src/memory.jl#L549
the p2p access is not enabled.
julia> CUDA.peer_access[]
8×8 Matrix{Int64}:
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
-1 0 0 0 0 0 0 0
However, by executing nvidia-smi topo -p2p r
, p2p access on my node should be ok:
nvidia-smi topo -p2p r
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
GPU2 OK OK X OK OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
OK, I missed the very important warning
NCCL version 2.19.4+cuda12.3
┌ Warning: Enabling peer-to-peer access between CuDevice(7) and CuDevice(0) failed; please file an issue.
│ exception =
│ CUDA error: peer access is already enabled (code 704, ERROR_PEER_ACCESS_ALREADY_ENABLED)
│ Stacktrace:
This is reported from https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/lib/cudadrv/context.jl#L404
Do I need to configure anything to pass the test?
(This is a fresh new installation based on the
pytorch:24.01-py3
image)