Open kose-y opened 2 months ago
Shouldn't m
and n
switch depending on trans
, just like with other wrappers? https://github.com/JuliaGPU/CUDA.jl/blob/8b54f853be833e6096fdc8a8a87e4c98acac7a5d/lib/cublas/wrappers.jl#L426-L438
Can you add a test that covers the case that doesn't work right now, and works after the change?
No, according to the official cuBLAS documentation, definitions of m
and n
for gemm
and gemv
interfaces are different.
For gemm
interfaces:
m
: number of rows of matrix op(A)
and C
.n
: number of columns of matrix op(B)
and C
.For gemv
:
m
: number of rows of matrix A
.n
: number of columns of matrix A
.For gemv
, they don't depend on op
.
I will try to add some tests this week.
See also the gemv!
function:
https://github.com/JuliaGPU/CUDA.jl/blob/bbe625bbd92cf2c3a5fde6aec1c940f0a1e2b039/lib/cublas/wrappers.jl#L378-L384
@maleadt A similar bug was found on gemv_batched!
, and it's also fixed. Tests have been added now.
LGTM, let's just ping the original author of these functions: @lpawela
@maleadt What is the status of this PR?
It hangs on me, sorry. I'll have a look within a couple of days.
I have problems launching tests on this patch.
From worker 2: Stacktrace:
From worker 2: [1] throw_api_error(res::CUDA.cudaError_enum)
From worker 2: @ CUDA ~/lib/CUDA.jl/lib/cudadrv/libcuda.jl:30
From worker 2: [2] check
From worker 2: @ ~/lib/CUDA.jl/lib/cudadrv/libcuda.jl:37 [inlined]
From worker 2: [3] cuMemFreeAsync
From worker 2: @ ~/lib/CUDA.jl/lib/utils/call.jl:34 [inlined]
From worker 2: [4] free(mem::CUDA.DeviceMemory; stream::CuStream)
From worker 2: @ CUDA ~/lib/CUDA.jl/lib/cudadrv/memory.jl:87
From worker 2: [5] free
From worker 2: @ ~/lib/CUDA.jl/lib/cudadrv/memory.jl:82 [inlined]
From worker 2: [6] #1102
From worker 2: @ ~/lib/CUDA.jl/src/memory.jl:708 [inlined]
From worker 2: [7] #context!#990
From worker 2: @ ~/lib/CUDA.jl/lib/cudadrv/state.jl:168 [inlined]
From worker 2: [8] context!
From worker 2: @ ~/lib/CUDA.jl/lib/cudadrv/state.jl:163 [inlined]
From worker 2: [9] _pool_free
From worker 2: @ ~/lib/CUDA.jl/src/memory.jl:707 [inlined]
From worker 2: [10] macro expansion
From worker 2: @ ./timing.jl:395 [inlined]
From worker 2: [11] pool_free(managed::CUDA.Managed{CUDA.DeviceMemory})
From worker 2: @ CUDA ~/lib/CUDA.jl/src/memory.jl:689
From worker 2: [12] release(::GPUArrays.RefCounted{CUDA.Managed{CUDA.DeviceMemory}})
From worker 2: @ GPUArrays ~/.julia/packages/GPUArrays/qt4ax/src/host/abstractarray.jl:42
From worker 2: [13] unsafe_free!
From worker 2: @ ~/.julia/packages/GPUArrays/qt4ax/src/host/abstractarray.jl:91 [inlined]
From worker 2: [14] unsafe_free!(xs::CuArray{Float32, 2, CUDA.DeviceMemory})
From worker 2: @ CUDA ~/lib/CUDA.jl/src/array.jl:94
From worker 2: [15] exit
From worker 2: @ ./initdefs.jl:28 [inlined]
From worker 2: [16] exit()
From worker 2: @ Base ./initdefs.jl:29
From worker 2: [17] #invokelatest#2
From worker 2: @ ./essentials.jl:892 [inlined]
From worker 2: [18] invokelatest(::Any)
From worker 2: @ Base ./essentials.jl:889
From worker 2: [19] (::Distributed.var"#118#120"{Distributed.RemoteDoMsg})()
From worker 2: @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:310
From worker 2: [20] run_work_thunk(thunk::Distributed.var"#118#120"{Distributed.RemoteDoMsg}, print_error::Bool)
From worker 2: @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
From worker 2: [21] (::Distributed.var"#117#119"{Distributed.RemoteDoMsg})()
From worker 2: @ Distributed ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:310
From worker 2: WARNING: Error while freeing DeviceMemory(1.562 KiB at 0x0000000302122a00):
From worker 2: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc))
when launching julia --project test/runtests.jl libraries/cublas
julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.4
NVIDIA driver 550.90.7
CUDA libraries:
- CUBLAS: 12.6.0
- CURAND: 10.3.7
- CUFFT: 11.2.6
- CUSOLVER: 11.6.4
- CUSPARSE: 12.5.2
- CUPTI: 2024.3.0 (API 24.0.0)
- NVML: 12.0.0+550.90.7
Julia packages:
- CUDA: 5.5.0
- CUDA_Driver_jll: 0.10.0+0
- CUDA_Runtime_jll: 0.15.1+0
Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7
1 device:
0: NVIDIA GeForce RTX 3080 (sm_86, 7.857 GiB / 10.000 GiB available)
julia> CUDA.CuError(CUDA.cudaError_enum(0x000002bc))
CuError(CUDA_ERROR_ILLEGAL_ADDRESS)
The changes in this PR seem to triggering some illegal memory access.
I'm seeing similar issues locally, but I'm having a hard time isolating the problem. Many times, the libraries/cublas
test suite hangs on this PR, while only taking the gemv
tests modified here doesn't reproduce the issue.
Actually, some more testing today reveals that the illegal memory access I was seeing locally comes from a different test.
@lpawela I cannot reproduce the isolated failure of the libraries/cublas
test suite you are seeing. Is this still the case on the latest version of this PR? Does it reproduce with just the gemv
tests from this PR?
using CUDA.CUBLAS, GPUArrays
using CUDA, Test, LinearAlgebra
using Adapt
struct ArrayAdaptor{AT} end
Adapt.adapt_storage(::ArrayAdaptor{AT}, xs::AbstractArray) where {AT} = AT(xs)
test_result(a::Number, b::Number; kwargs...) = ≈(a, b; kwargs...)
test_result(a::Missing, b::Missing; kwargs...) = true
test_result(a::Number, b::Missing; kwargs...) = false
test_result(a::Missing, b::Number; kwargs...) = false
function test_result(a::AbstractArray{T}, b::AbstractArray{T}; kwargs...) where {T<:Number}
≈(collect(a), collect(b); kwargs...)
end
function test_result(a::AbstractArray{T}, b::AbstractArray{T};
kwargs...) where {T<:NTuple{N,<:Number} where {N}}
ET = eltype(T)
≈(reinterpret(ET, collect(a)), reinterpret(ET, collect(b)); kwargs...)
end
function test_result(as::NTuple{N,Any}, bs::NTuple{N,Any}; kwargs...) where {N}
all(zip(as, bs)) do (a, b)
test_result(a, b; kwargs...)
end
end
function compare(f, AT::Type{<:AbstractGPUArray}, xs...; kwargs...)
# copy on the CPU, adapt on the GPU, but keep Ref's
cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), xs)
gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x), xs)
cpu_out = f(cpu_in...)
gpu_out = f(gpu_in...)
test_result(cpu_out, gpu_out; kwargs...)
end
function compare(f, AT::Type{<:Array}, xs...; kwargs...)
# no need to actually run this tests: we have nothing to compoare against,
# and we'll run it on a CPU array anyhow when comparing to a GPU array.
#
# this method exists so that we can at least run the test suite with Array,
# and make sure we cover other tests (that don't call `compare`) too.
return true
end
testf(f, xs...; kwargs...) = compare(f, CuArray, xs...; kwargs...)
m = 20
n = 35
k = 13
@testset for elty in [Float32, Float64, ComplexF32, ComplexF64]
alpha = rand(elty)
beta = rand(elty)
@testset "gemv" begin
@test testf(*, rand(elty, m, n), rand(elty, n))
@test testf(*, transpose(rand(elty, m, n)), rand(elty, m))
@test testf(*, rand(elty, m, n)', rand(elty, m))
x = rand(elty, m)
A = rand(elty, m, m + 1 )
y = rand(elty, n)
dx = CuArray(x)
dA = CuArray(A)
dy = CuArray(y)
@test_throws DimensionMismatch mul!(dy, dA, dx)
A = rand(elty, m + 1, m )
dA = CuArray(A)
@test_throws DimensionMismatch mul!(dy, dA, dx)
x = rand(elty, m)
A = rand(elty, n, m)
dx = CuArray(x)
dA = CuArray(A)
alpha = rand(elty)
dy = CUBLAS.gemv('N', alpha, dA, dx)
hy = collect(dy)
@test hy ≈ alpha * A * x
dy = CUBLAS.gemv('N', dA, dx)
hy = collect(dy)
@test hy ≈ A * x
dy = CuArray(y)
dx = CUBLAS.gemv(elty <: Real ? 'T' : 'C', alpha, dA, dy)
hx = collect(dx)
@test hx ≈ alpha * A' * y
end
if CUBLAS.version() >= v"11.9"
@testset "gemv_batched" begin
x = [rand(elty, m) for i=1:10]
A = [rand(elty, n, m) for i=1:10]
y = [rand(elty, n) for i=1:10]
dx = CuArray{elty, 1}[]
dA = CuArray{elty, 2}[]
dy = CuArray{elty, 1}[]
dbad = CuArray{elty, 1}[]
for i=1:length(A)
push!(dA, CuArray(A[i]))
push!(dx, CuArray(x[i]))
push!(dy, CuArray(y[i]))
if i < length(A) - 2
push!(dbad,CuArray(dx[i]))
end
end
@test_throws DimensionMismatch CUBLAS.gemv_batched!('N', alpha, dA, dx, beta, dbad)
CUBLAS.gemv_batched!('N', alpha, dA, dx, beta, dy)
for i=1:length(A)
hy = collect(dy[i])
y[i] = alpha * A[i] * x[i] + beta * y[i]
@test y[i] ≈ hy
end
dy = CuArray{elty, 1}[]
for i=1:length(A)
push!(dy, CuArray(y[i]))
end
CUBLAS.gemv_batched!(elty <: Real ? 'T' : 'C', alpha, dA, dy, beta, dx)
for i=1:size(A, 3)
hx = collect(dx[i])
x[i] = alpha * A[i]' * y[i] + beta * x[i]
@test x[i] ≈ hx
end
end
end
if CUBLAS.version() >= v"11.9"
@testset "gemv_strided_batched" begin
x = rand(elty, m, 10)
A = rand(elty, n, m, 10)
y = rand(elty, n, 10)
bad = rand(elty, m, 10)
dx = CuArray(x)
dA = CuArray(A)
dy = CuArray(y)
dbad = CuArray(bad)
@test_throws DimensionMismatch CUBLAS.gemv_strided_batched!('N', alpha, dA, dx, beta, dbad)
bad = rand(elty, n, 2)
dbad = CuArray(bad)
@test_throws DimensionMismatch CUBLAS.gemv_strided_batched!('N', alpha, dA, dx, beta, dbad)
CUBLAS.gemv_strided_batched!('N', alpha, dA, dx, beta, dy)
for i=1:size(A, 3)
hy = collect(dy[:, i])
y[:, i] = alpha * A[:, :, i] * x[:, i] + beta * y[:, i]
@test y[:, i] ≈ hy
end
dy = CuArray(y)
CUBLAS.gemv_strided_batched!(elty <: Real ? 'T' : 'C', alpha, dA, dy, beta, dx)
for i=1:size(A, 3)
hx = collect(dx[:, i])
x[:, i] = alpha * A[:, :, i]' * y[:, i] + beta * x[:, i]
@test x[:, i] ≈ hx
end
end
end
end
I still get the same error, even on a different machine. The command julia --project test/runtests.jl libraries/cublas
julia> using CUDA
CUDA.version
julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.2
NVIDIA driver 535.183.1
CUDA libraries:
- CUBLAS: 12.6.1
- CURAND: 10.3.7
- CUFFT: 11.2.6
- CUSOLVER: 11.6.4
- CUSPARSE: 12.5.3
- CUPTI: 2024.3.1 (API 24.0.0)
- NVML: 12.0.0+535.183.1
Julia packages:
- CUDA: 5.5.0
- CUDA_Driver_jll: 0.10.1+0
- CUDA_Runtime_jll: 0.15.2+0
Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7
1 device:
0: NVIDIA GeForce RTX 3090 (sm_86, 22.477 GiB / 24.000 GiB available)
Okay, thanks for confirming! Marked as draft until we figure out the exact issue here.
EDIT: does the isolated reproduces also give the same error?
Fix incorrect definition of m and n in gemv_strided_batched!