Closed maleadt closed 2 months ago
Uhm, perhaps the log is misleading, because one test (and only one) is failing with
Some tests did not pass: 408 passed, 0 failed, 1 errored, 0 broken.
base/array: Error During Test at /home/cceamgi/.julia/packages/CUDA/54m3h/test/base/array.jl:842
Got exception outside of a @test
ArgumentError: cannot take the GPU address of inaccessible device memory.
You are trying to use memory from GPU 1 while executing on GPU 0.
P2P access between these devices is not possible; either switch execution to GPU 1
by calling `CUDA.device!(1)`, or copy the data to an array allocated on device 0.
Stacktrace:
[1] convert(::Type{CuPtr{Float64}}, managed::CUDA.Managed{CUDA.DeviceMemory})
@ CUDA ~/.julia/packages/CUDA/54m3h/src/memory.jl:540
[2] unsafe_convert
@ ~/.julia/packages/CUDA/54m3h/src/array.jl:429 [inlined]
[3] #pointer#1109
@ ~/.julia/packages/CUDA/54m3h/src/array.jl:387 [inlined]
[4] pointer
@ ~/.julia/packages/CUDA/54m3h/src/array.jl:379 [inlined]
[5] (::CUDA.var"#1115#1116"{Float64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, CuArray{Float64, 2, CUDA.DeviceMemory}, Int64, Int64})()
@ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:569
[6] #context!#978
@ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:170 [inlined]
[7] context!
@ ~/.julia/packages/CUDA/54m3h/lib/cudadrv/state.jl:165 [inlined]
[8] unsafe_copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, doffs::Int64, src::CuArray{Float64, 2, CUDA.DeviceMemory}, soffs::Int64, n::Int64)
@ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:567
[9] copyto!
@ ~/.julia/packages/CUDA/54m3h/src/array.jl:512 [inlined]
[10] copyto!(dest::CuArray{Float64, 2, CUDA.DeviceMemory}, src::CuArray{Float64, 2, CUDA.DeviceMemory})
@ CUDA ~/.julia/packages/CUDA/54m3h/src/array.jl:516
[11] macro expansion
@ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:850 [inlined]
[12] macro expansion
@ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
[13] macro expansion
@ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:843 [inlined]
[14] macro expansion
@ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
[15] top-level scope
@ ~/.julia/packages/CUDA/54m3h/test/base/array.jl:776
[16] include
@ ./client.jl:489 [inlined]
[17] #11
@ ~/.julia/packages/CUDA/54m3h/test/runtests.jl:87 [inlined]
[18] macro expansion
@ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
[19] macro expansion
@ ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
[20] macro expansion
@ ~/.julia/packages/CUDA/54m3h/test/setup.jl:60 [inlined]
[21] macro expansion
@ ~/.julia/packages/CUDA/54m3h/src/utilities.jl:35 [inlined]
[22] macro expansion
@ ~/.julia/packages/CUDA/54m3h/src/memory.jl:813 [inlined]
[23] top-level scope
@ ~/.julia/packages/CUDA/54m3h/test/setup.jl:59
[24] eval
@ ./boot.jl:385 [inlined]
[25] runtests(f::Function, name::String, time_source::Symbol)
@ Main ~/.julia/packages/CUDA/54m3h/test/setup.jl:71
[26] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
@ Base ./essentials.jl:892
[27] invokelatest(::Any, ::Any, ::Vararg{Any})
@ Base ./essentials.jl:889
[28] (::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}})()
@ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
[29] run_work_thunk(thunk::Distributed.var"#110#112"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
@ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
[30] (::Distributed.var"#109#111"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
@ Distributed ~/.julia/juliaup/julia-1.10.2+0.aarch64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:287
which suggests this is trying to run some code on both GPUs.
99% of the tests are only going to be using device 0; the fact that multiple devices are available only enables certain tests that require them. We don't do load balancing over multiple devices or anything (typically the CPU is the bottleneck anyway).
I restarted the julia session (I think I messed up the value of CUDA_VISIBLE_DEVICES
before) and now I get
│ 2 devices:
│ 0: NVIDIA A100 80GB PCIe (sm_80, 78.998 GiB / 80.000 GiB available)
└ 1: NVIDIA A100 80GB PCIe (sm_80, 79.135 GiB / 80.000 GiB available)
[ Info: Testing using device 0 (NVIDIA A100 80GB PCIe) and 1 (NVIDIA A100 80GB PCIe). To change this, specify the `--gpu` argument to the tests, or set the `CUDA_VISIBLE_DEVICES` environment variable.
which is more promising, but I still get the test failure above.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 58.54%. Comparing base (
5dd6bb2
) to head (efe63d6
). Report is 1 commits behind head on master.:exclamation: Current head efe63d6 differs from pull request most recent head 0d661b7. Consider uploading reports for the commit 0d661b7 to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Pushed a fix for that issue; @giordano can you try again?
Test Summary: | Pass Broken Total Time
Overall | 24156 9 24165
SUCCESS
Testing CUDA tests passed
All green now, thanks!
I'm getting
and looking at btop seems to confirm only first device is being used.