Test multigpu on CI - Githubissues

giordano commented 2 months ago

I'm getting

julia> Pkg.test("CUDA"; test_args=`--gpu=0,1`);
     Testing CUDA
[...]
  [052768ef] CUDA v5.4.0 `https://github.com/JuliaGPU/CUDA.jl#tb/multigpu`
[...]
     Testing Running tests...
┌ Info: System information:
│ CUDA runtime 12.4, artifact installation
│ CUDA driver 12.4
│ NVIDIA driver 550.54.14
│ 
│ CUDA libraries: 
│ - CUBLAS: 12.4.5
│ - CURAND: 10.3.5
│ - CUFFT: 11.2.1
│ - CUSOLVER: 11.6.1
│ - CUSPARSE: 12.3.1
│ - CUPTI: 22.0.0
│ - NVML: 12.0.0+550.54.14
│ 
│ Julia packages: 
│ - CUDA: 5.4.0
│ - CUDA_Driver_jll: 0.8.1+0
│ - CUDA_Runtime_jll: 0.12.1+0
│ 
│ Toolchain:
│ - Julia: 1.10.2
│ - LLVM: 15.0.7
│ 
│ 2 devices:
│   0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└   1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.

and looking at btop seems to confirm only first device is being used.

giordano commented 2 months ago

Uhm, perhaps the log is misleading, because one test (and only one) is failing with

Some tests did not pass: 408 passed, 0 failed, 1 errored, 0 broken.
base/array: Error During Test at /home/cceamgi/.julia/packages/CUDA/54m3h/test/base/array.jl:842
  Got exception outside of a @test
  ArgumentError: cannot take the GPU address of inaccessible device memory.

  You are trying to use memory from GPU 1 while executing on GPU 0.
  P2P access between these devices is not possible; either switch execution to GPU 1
  by calling `CUDA.device!(1)`, or copy the data to an array allocated on device 0.
  Stacktrace:
    [1] convert(::Type{CuPtr{Float64}}, managed::CUDA.Managed{CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/memory.jl:540
    [2] unsafe_convert
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:429 [inlined]
    [3] #pointer#1109
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:387 [inlined]
    [4] pointer
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:379 [inlined]
    [5] (::CUDA.var"#1115#1116"{Float64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, Int64})()
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:569
    [6] #context!#978
      @ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:170 [inlined]
    [7] context!
      @ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:165 [inlined]
    [8] unsafe_copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, doffs::Int64, src::CuArray{Float64, 2, CUDA.DeviceMemory}, soffs::Int64, n::Int64)
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:567
    [9] copyto!
      @ ~/.julia/packages/CUDA/54m3h/src/array.jl:512 [inlined]
   [10] copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, src::CuArray{Float64, 2, CUDA.DeviceMemory})
      @ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:516
   [11] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:850 [inlined]
   [12] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [13] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:843 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [15] top-level scope
      @ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:776
   [16] include
      @ ./client.jl:489 [inlined]
   [17] #11
      @ ~/.julia/packages/CUDA/54m3h/test/runtests.jl:87 [inlined]
   [18] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
   [19] macro expansion
      @ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [20] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
   [21] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/src/utilities.jl:35 [inlined]
   [22] macro expansion
      @ ~/.julia/packages/CUDA/54m3h/src/memory.jl:813 [inlined]
   [23] top-level scope
      @ ~/.julia/packages/CUDA/54m3h/test/setup.jl:59
   [24] eval
      @ ./boot.jl:385 [inlined]
   [25] runtests(f::Function, name::String, time_source::Symbol)
      @ Main ~/.julia/packages/CUDA/54m3h/test/setup.jl:71
   [26] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
      @ Base ./essentials.jl:892
   [27] invokelatest(::Any, ::Any, ::Vararg{Any})
      @ Base ./essentials.jl:889
   [28] (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
   [29] run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
   [30] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287

which suggests this is trying to run some code on both GPUs.

maleadt commented 2 months ago

99% of the tests are only going to be using device 0; the fact that multiple devices are available only enables certain tests that require them. We don't do load balancing over multiple devices or anything (typically the CPU is the bottleneck anyway).

giordano commented 2 months ago

I restarted the julia session (I think I messed up the value of CUDA_VISIBLE_DEVICES before) and now I get

│ 2 devices:
│   0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└   1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.

which is more promising, but I still get the test failure above.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 58.54%. Comparing base (5dd6bb2) to head (efe63d6). Report is 1 commits behind head on master.

:exclamation: Current head efe63d6 differs from pull request most recent head 0d661b7. Consider uploading reports for the commit 0d661b7 to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #2348 +/- ## =========================================== - Coverage 71.86% 58.54% -13.33% =========================================== Files 155 155 Lines 15072 14964 -108 =========================================== - Hits 10832 8760 -2072 - Misses 4240 6204 +1964 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

maleadt commented 2 months ago

Pushed a fix for that issue; @giordano can you try again?

giordano commented 2 months ago

Test Summary: |  Pass  Broken  Total  Time
  Overall     | 24156       9  24165      
    SUCCESS
     Testing CUDA tests passed

All green now, thanks!

JuliaGPU / CUDA.jl

Test multigpu on CI #2348

Codecov Report