JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.19k stars 214 forks source link

CuSparse factorizations #1396

Open ChrisRackauckas opened 2 years ago

ChrisRackauckas commented 2 years ago

MWE:

using CUDA, SparseArrays, LinearAlgebra
CUDA.allowscalar(false)
W = CUDA.CUSPARSE.CuSparseMatrixCSR(sprand(100,100,0.1))

Now if you factorize you get CPU fallbacks:

lu(W)

#=
MethodError: no method matching lu!(::CUDA.CUSPARSE.CuSparseMatrixCSR{Float64, Int32}, ::RowMaximum; check=true)
Closest candidates are:
  lu!(!Matched::StridedMatrix{T}, ::RowMaximum; check) where T<:Union{Float32, Float64, ComplexF32, ComplexF64} at C:\Users\accou\AppData\Local\Programs\Julia-1.7.0\share\julia\stdlib\v1.7\LinearAlgebra\src\lu.jl:80
  lu!(!Matched::Union{Hermitian{T, S}, Symmetric{T, S}} where {T, S}, ::Union{NoPivot, RowMaximum}; check) at C:\Users\accou\AppData\Local\Programs\Julia-1.7.0\share\julia\stdlib\v1.7\LinearAlgebra\src\lu.jl:88
  lu!(!Matched::StridedMatrix{T} where T, ::Union{NoPivot, RowMaximum}; check) at C:\Users\accou\AppData\Local\Programs\Julia-1.7.0\share\julia\stdlib\v1.7\LinearAlgebra\src\lu.jl:134
  ...
lu(A::CUDA.CUSPARSE.CuSparseMatrixCSR{Float64, Int32}, pivot::RowMaximum; check::Bool) at lu.jl:279
lu(A::CUDA.CUSPARSE.CuSparseMatrixCSR{Float64, Int32}, pivot::RowMaximum) at lu.jl:278
lu(A::CUDA.CUSPARSE.CuSparseMatrixCSR{Float64, Int32}) at lu.jl:278
top-level scope at test.jl:122
eval at boot.jl:373 [inlined]
=#

qr(W)

#=
LinearAlgebra.QRCompactWY{Float64, Matrix{Float64}}
Q factor:
100×100 LinearAlgebra.QRCompactWYQ{Float64, Matrix{Float64}}:
 -0.493254   0.0       0.0759867  …  -0.115005   0.0  0.0  0.0  0.0
  0.0        0.0       0.0           -0.040375   0.0  0.0  0.0  0.0
  0.0        0.0       0.0            0.186842   0.0  0.0  0.0  0.0
  0.0        0.0       0.0            0.0244859  0.0  0.0  0.0  0.0
  0.0        0.0       0.0           -0.0807226  0.0  0.0  0.0  0.0
 -0.156698   0.0       0.0241396  …   0.0316981  0.0  0.0  0.0  0.0
  0.0        0.0       0.0            0.0897126  0.0  0.0  0.0  0.0
  0.0        0.0       0.0            0.0381239  0.0  0.0  0.0  0.0
  0.0        0.0       0.0           -0.0132309  0.0  0.0  0.0  0.0
  0.0        0.0       0.0            0.0738399  0.0  0.0  0.0  0.0
  ⋮                               ⋱   ⋮                         
  0.0       -0.499697  0.0           -0.0693182  0.0  0.0  0.0  0.0
  0.0        0.0       0.0            0.051997   0.0  0.0  0.0…
=#
maleadt commented 2 years ago

This isn't just 'an overload missing'; CUSPARSE doesn't support these operations. Only incomplete factorizations for use as a preconditioner. I'm not sure if/how that generalizes (thoughts, @kshyatt?).

ChrisRackauckas commented 2 years ago

oh interesting. If there's a CUDA ILU that would be great as well, and that's okay if it's non-generic. Are those operations wrapped?

maleadt commented 2 years ago

Yeah, LU and Cholesky, both here: https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/preconditioners.jl

frapac commented 2 years ago

cusolverRF is able to refactorize the matrix entirely on the GPU, but it requires to compute the initial symbolic factorization on the CPU, and a fixed sparsity pattern.

This is our observation with most of the factorizations implemented in CUSPARSE: they deport seamlessly the factorization to the host, inducing a lot of data transfer if the matrix is large. And that even for QR: this can be observed with a profiler. To avoid this, NVIDIA provides more coarse grained control to the factorization routines in the header cusolverSp_LOWLEVEL_PREVIEW.h (undocumented). See for example the code for the Cholesky factorization: https://github.com/tpn/cuda-samples/blob/master/v8.0/7_CUDALibraries/cuSolverSp_LowlevelCholesky/cuSolverSp_LowlevelCholesky.cpp

ChrisRackauckas commented 2 years ago

cusolverRF is able to refactorize the matrix entirely on the GPU, but it requires to compute the initial symbolic factorization on the CPU, and a fixed sparsity pattern.

That would be fine for us. We can just specialize that in LinearSolve.jl. That would be relatively straightforward to do, if we have the right functionality to call cusolverRF

ChrisRackauckas commented 2 years ago

@maleadt do you have an example of how to use the SparseChar?

maleadt commented 2 years ago

Set it to 'O' for one-based arrays (i.e. sparse arrays created in Julia).

https://github.com/JuliaGPU/CUDA.jl/blob/c616fe853bebb6de038241694cb42f635c76165f/lib/cusparse/types.jl#L101-L109

frapac commented 2 years ago

@ChrisRackauckas we tried to wrap cusolverRF in CUDA.jl in #828 , but the PR is a bit stalled now (it's on me). We are currently using a custom wrapper for cusolverRF, where we use our own functions for the triangular solves, using SpSV/SpSM (the functions sv!/sm! wrapped in CUSPARSE.jl just allocate too much memory for our use case, leading to unexpected crashes).

ChrisRackauckas commented 2 years ago

I almost got this all working in https://github.com/SciML/OrdinaryDiffEq.jl/issues/1566#issuecomment-1059890002, but the last step is the ldiv! action which is missing on the CuSparseMatrixCSR. It's hard to make an MWE because ilu02! throws a zero pivot error on the random cases 😓, so that comment has the matrix that would be needed to see it. But I'm essentially trying to figure out how to make it act just like lu(A) \ ....