JuliaGPU / NCCL.jl

A Julia wrapper for the NVIDIA Collective Communications Library.
MIT License
24 stars 10 forks source link

Tests failed #58

Closed findmyway closed 1 month ago

findmyway commented 2 months ago

Do I need to configure anything to pass the test?

(This is a fresh new installation based on the pytorch:24.01-py3 image)

     Testing Running tests...
┌ Info: CUDA information:
│ CUDA runtime 12.5, artifact installation
│ CUDA driver 12.5
│ NVIDIA driver 535.161.8, originally for CUDA 12.2
│
│ CUDA libraries:
│ - CUBLAS: 12.5.3
│ - CURAND: 10.3.6
│ - CUFFT: 11.2.3
│ - CUSOLVER: 11.6.3
│ - CUSPARSE: 12.5.1
│ - CUPTI: 2024.2.1 (API 23.0.0)
│ - NVML: 12.0.0+535.161.8
│
│ Julia packages:
│ - CUDA: 5.4.3
│ - CUDA_Driver_jll: 0.9.1+1
│ - CUDA_Runtime_jll: 0.14.1+0
│
│ Toolchain:
│ - Julia: 1.10.4
│ - LLVM: 15.0.7
│
│ 8 devices:
│   0: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   1: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   2: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   3: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   4: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   5: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   6: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
└   7: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
[ Info: NCCL version: 2.19.4
Communicator: Error During Test at /....../NCCL.jl/test/runtests.jl:11
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
  Stacktrace:
    [1] check
      @ /....../NCCL.jl/src/libnccl.jl:17 [inlined]
    [2] ncclCommInitAll
      @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:34 [inlined]
    [3] Communicators(deviceids::Vector{Int32})
      @ NCCL /....../NCCL.jl/src/communicator.jl:70
    [4] Communicators(devices::CUDA.DeviceIterator)
      @ NCCL /....../NCCL.jl/src/communicator.jl:80
    [5] macro expansion
      @ /....../NCCL.jl/test/runtests.jl:13 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [7] macro expansion
      @ /....../NCCL.jl/test/runtests.jl:13 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [9] top-level scope
      @ /....../NCCL.jl/test/runtests.jl:11
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [11] top-level scope
      @ none:6
   [12] eval
      @ ./boot.jl:385 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [14] _start()
      @ Base ./client.jl:552
findmyway commented 2 months ago

By setting NCCL_DEBUG=INFO I got the following error msg:

NCCL WARN Cuda failure 'initialization error'
findmyway commented 2 months ago

It seems in the original pipeline, there are some extra configurations

https://github.com/JuliaGPU/NCCL.jl/blob/e88e2683334dcb7a7a84064ba4e9e54555dbaf15/.buildkite/pipeline.yml#L55-L57

By setting LocalPreferences.toml to

[CUDA_Runtime_jll]
version = "12.3"
[CUDA_Driver_jll]
compat = "false"

Now I can at least initialize NCCL.Communicators, all collective operations hit the following error

sum: Error During Test at NCCL.jl/test/runtests.jl:29
  Got exception outside of a @test
  ArgumentError: cannot take the GPU address of inaccessible device memory.

  You are trying to use memory from GPU 0 on GPU 7.
  P2P access between these devices is not possible; either switch to GPU 0
  by calling `CUDA.device!(0)`, or copy the data to an array allocated on device 7.
findmyway commented 2 months ago

The error is triggered when converting CuArray into a CuPtr. And the root cause is that

https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/src/memory.jl#L549

the p2p access is not enabled.

julia> CUDA.peer_access[]
8×8 Matrix{Int64}:
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0
 -1  0  0  0  0  0  0  0

However, by executing nvidia-smi topo -p2p r, p2p access on my node should be ok:

nvidia-smi topo -p2p r
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
 GPU0   X       OK      OK      OK      OK      OK      OK      OK
 GPU1   OK      X       OK      OK      OK      OK      OK      OK
 GPU2   OK      OK      X       OK      OK      OK      OK      OK
 GPU3   OK      OK      OK      X       OK      OK      OK      OK
 GPU4   OK      OK      OK      OK      X       OK      OK      OK
 GPU5   OK      OK      OK      OK      OK      X       OK      OK
 GPU6   OK      OK      OK      OK      OK      OK      X       OK
 GPU7   OK      OK      OK      OK      OK      OK      OK      X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

OK, I missed the very important warning

NCCL version 2.19.4+cuda12.3
┌ Warning: Enabling peer-to-peer access between CuDevice(7) and CuDevice(0) failed; please file an issue.
│   exception =
│    CUDA error: peer access is already enabled (code 704, ERROR_PEER_ACCESS_ALREADY_ENABLED)
│    Stacktrace:

This is reported from https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/lib/cudadrv/context.jl#L404