JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.19k stars 214 forks source link

Support for creating a CuSparseMatrixCSC from a CuSparseVector #2484

Open marcdelabarrera opened 1 week ago

marcdelabarrera commented 1 week ago

Multiplying a CuSparseMatrixCSC and a CuSparseVector vector returns a CuArray, instead of a CuSparseVector.

using CUDA
using CUDA.CUSPARSE
using Random
using SparseArrays
Random.seed!(123)

N = 100
q = CuSparseVector(sprand(n,0.2))
M = sprand(Float64, N, N, 0.2)
M = CUSPARSE.CuSparseMatrixCSC(CuVector(M.colptr), CuVector(M.rowval), CuVector(M.nzval), (N, N))
M*q+q

Will get an error broadcast with sparse arrays is currently only implemented for CSR and CSC matrices

Is there a way to handle these operations? An obvious one is to convert the CuArrayto a CuSparseVector again and proceed with the code. I am not sure if this is what we are expected to do.

maleadt commented 1 week ago

M = CUSPARSE.CuSparseMatrixCSC(CuVector(M.colptr), CuVector(M.rowval), CuVector(M.nzval), (N, N))

You can simply use cu:

julia> cu(M)
100×100 CuSparseMatrixCSC{Float32, Int32} with 1965 stored entries:

About the 'issue' that CuSparseMatrixCSC CuSparseVector results in a CuArray: which CUSPARSE API would you want this to use instead? AFAIK, CUSPARSE doesn't have much sparse sparse functionality. Right now, the sparseness of the vector gets dropped by CUDA.jl in order to be able to use cusparseSpMV: https://github.com/JuliaGPU/CUDA.jl/blob/d72cdaa324cfadc1e67a7aa7fb9c4b035d2ec07c/lib/cusparse/interfaces.jl#L76-L79

marcdelabarrera commented 1 week ago

Thank you for the tip on getting M to a CuSparseMatrixCSCthe way it has to be (I'm new with CUDA)

For context, I'm trying to replicate the code that solves a Linear Complementarity Problem with sparse matrix on GPU. (Codes for matlab and non GPU julia.)

CUDA has support for sparse matrix sparse matrix, cusparseSpGEMM this is why I was surprised that there is no support for sparse matrix sparse vector. I can convert the vector q into an Nx1 matrix and get an Nx1 sparse matrix. Or I can do the multiplication, get a CuArray, and then transform it into a CuSparseVector. Both things seem inefficient.

maleadt commented 1 week ago

CUDA has support for sparse matrix sparse matrix, cusparseSpGEMM this is why I was surprised that there is no support for sparse matrix sparse vector.

Ah, right, I guess the generic_matvecmul! method above could then rather reinterpret the input sparse vector as a 1-column CSC sparse matrix and use cusparseSpGEMM, is what you're suggesting? Although that should preserve the sparsity, the performance with such kind of inputs may be bad.

Quickly testing this out using:

function CuSparseMatrixCSC(vec::CuSparseVector{Tv,Ti}) where {Tv,Ti}
    colPtr = CuVector{Ti}([1, vec.nnz + 1])
    CuSparseMatrixCSC{Tv,Ti}(colPtr, vec.iPtr, vec.nzVal, (vec.len, 1))
end

q2 = CuSparseMatrixCSC(q)
# assuming n=N, which is missing from your MWE
julia> @benchmark CUDA.@sync M*q
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  33.730 μs …  21.865 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.319 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.661 μs ± 218.323 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▂▅▅█▇█▆▆▃▄▃▂▃
  ▂▂▂▂▂▂▃▃▄▅▆████████████████▇▇▆▄▄▄▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂ ▄
  33.7 μs         Histogram: frequency by time         38.8 μs <

 Memory estimate: 3.39 KiB, allocs estimate: 139.

julia> @benchmark CUDA.@sync M*q2
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  141.289 μs … 705.674 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     145.888 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   146.093 μs ±   6.037 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        ▁▂▄█▂▁
  ▂▁▁▂▂▂▂▂▂▂▂▃▃▄▄▄▅▅▆▇▇████████▆▆▆▄▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  141 μs           Histogram: frequency by time          152 μs <

 Memory estimate: 3.41 KiB, allocs estimate: 137.

So doing the sparse multiplication is way slower.

Also, isn't a dense output kind-of reasonable here, as a sparse matrix * sparse vector is likely to yield a vector with few zeros? At least your example does. I'm not too familiar with the domain, but if your sparsity pattern is specific, maybe you can create a PR "productionizing" the conversion above such that you can use it in order to dispatch to cusparseSpGEMM, without changing the current behavior.

marcdelabarrera commented 1 week ago

Thank you a lot. I guess you are right. My matrix is very sparse (way more than 80%), but my vector is not that sparse, so the resulting vector has almost all entries being nonzero anyway.

maleadt commented 1 week ago

Alright, I think we can narrow the scope to making this conversion possible then.

amontoison commented 4 days ago

The output of sparse matrix * sparse vector is a dense vector in many cases. We can add a dispatch for it but it will be slow except if the user knows that it can be applied efficiently on his problem.

amontoison commented 4 days ago

We can reuse this trick: https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/extra.jl#L84

I implemented it such that we can add two sparse vectors. NVIDIA only provides a routine to add two sparse matrices.

maleadt commented 4 days ago

We can reuse this trick: https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/extra.jl#L84

That's the same as https://github.com/JuliaGPU/CUDA.jl/issues/2484#issuecomment-2331822558, right? It should use a single column, Julia being column major.

amontoison commented 4 days ago

We can reuse this trick: https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/extra.jl#L84

That's the same as #2484 (comment), right? It should use a single column, Julia being column major.

Yes, but you add an additional layer of transformation if you use CSC matrices.

Julia is column major but many CUSPARSE routines are just implemented for sparse CSR matrices. Sparse matrix matrix is one example of CUDA routine only implemented for CSR CSR. If something is working with CSC matrices (like sparse triangular solves or sparse CSC * CSC), it's because we use the relation between A and A' for CSC / CSR formats. https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/generic.jl#L567

maleadt commented 4 days ago

Oh right I didn't see the CSR instead of CSC. I guess we could just provide both constructors, as this is going to be a manually-invoked conversion anyway.

amontoison commented 4 days ago

Yes, I can open a PR later this week for that.