add cublas<t>getrsBatched

bjarthur commented 1 month ago

uses getrf_batched as a template.

also corrects README to indicate support for spmv and spr.

currently getting an error that i'm having trouble fixing:

julia> using CUDA

julia> A = CUDA.rand(5,5,3);

julia> B = CUDA.rand(5,2);

julia> pivot, _ = CUDA.CUBLAS.getrf_strided_batched!(A, true);

julia> CUDA.CUBLAS.getrs_strided_batched!('N', A, pivot, B)
ArgumentError: cannot take the CPU address of GPU memory.

You are probably falling back to or otherwise calling CPU functionality
with GPU array inputs. This is not supported by regular device memory;
ensure this operation is supported by CUDA.jl, and if it isn't, try to
avoid it or rephrase it in terms of supported operations. Alternatively,
you can consider using GPU arrays backed by unified memory by
allocating using `cu(...; unified=true)`.
Stacktrace:
  [1] convert(::Type{Ptr{Int32}}, managed::CUDA.Managed{CUDA.DeviceMemory})
    @ CUDA /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/src/memory.jl:573
  [2] unsafe_convert(typ::Type{Ptr{Int32}}, x::CuArray{Int32, 1, CUDA.DeviceMemory})
    @ CUDA /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/src/array.jl:432
  [3] macro expansion
    @ /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/utils/call.jl:215 [inlined]
  [4] macro expansion
    @ /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/cublas/libcublas.jl:5274 [inlined]
  [5] #990
    @ /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/utils/call.jl:35 [inlined]
  [6] retry_reclaim
    @ /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/src/memory.jl:434 [inlined]
  [7] check
    @ /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/cublas/libcublas.jl:24 [inlined]
  [8] cublasSgetrsBatched
    @ /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/utils/call.jl:34 [inlined]
  [9] getrs_batched!(trans::Char, n::Int64, nrhs::Int64, Aptrs::CuArray{CuPtr{Float32}, 1, CUDA.DeviceMemory}, lda::Int64, pivotArray::CuPtr{Int32}, Bptrs::CuArray{CuPtr{Float32}, 1, CUDA.DeviceMemory}, ldb::Int64)
    @ CUDA.CUBLAS /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/cublas/wrappers.jl:1900
 [10] getrs_strided_batched!(trans::Char, A::CuArray{Float32, 3, CUDA.DeviceMemory}, pivotArray::CuArray{Int32, 2, CUDA.DeviceMemory}, B::CuArray{Float32, 2, CUDA.DeviceMemory})
    @ CUDA.CUBLAS /groups/scicompsoft/home/arthurb/.julia/dev/CUDA/lib/cublas/wrappers.jl:1951
 [11] top-level scope
    @ REPL[5]:1

bjarthur commented 1 month ago

curious that info for cublasSgetrfBatched is a CuPtr{Cint} (code), whereas that for cublasSgetrsBatched is a Ptr{Cint} (code). that's the source of the problem above, but fixing it i now i get:

WARNING: Error while freeing DeviceMemory(12 bytes at 0x0000000402000600):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc))

maleadt commented 1 month ago

Thanks for the PR!

curious that info for cublasSgetrfBatched is a CuPtr{Cint} (code), whereas that for cublasSgetrsBatched is a Ptr{Cint} (code).

According to https://docs.nvidia.com/cuda/cublas/index.html?highlight=cublasSgetrsBatched#cublas-t-getrsbatched, info is host memory, so should be a CPU pointer.

bjarthur commented 1 month ago

ok, so info is a scalar not vector, and B needs to be 3D not 2. with those changes my MWE above works. will write some tests soon...

amontoison commented 1 month ago

@bjarthur Ping me when the tests are ready.

bjarthur commented 1 month ago

@amontoison ready for review

bjarthur commented 1 month ago

i should note that i tried to add tests for no pivoting, but it is numerically unstable and so it was hard to make sure the output was correct. getrf does not test for pivot=false either.

JuliaGPU / CUDA.jl

add cublas<t>getrsBatched #2385