Addition and multiplication over cuarray and cusparse generates inconsistent outcome types or error

yuehhua commented 3 years ago

julia> using CUDA, SparseArrays

julia> dense = rand([0,1], 3, 4)  # Matrix{Int64}

julia> sp = sparse(dense)  # SparseMatrixCSC{Int64, Int64}

julia> cuarray = cu(dense)  # CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}

julia> cusparse = cu(sp);  # CUDA.CUSPARSE.CuSparseMatrixCSC{Int64}

It is reasonable to add dense array and sparse array and gives a dense array.

julia> typeof(dense + sp)
Matrix{Int64} (alias for Array{Int64, 2})

julia> typeof(sp + dense)
Matrix{Int64} (alias for Array{Int64, 2})

But, in cuda, adding cuarray and cusparse array gives a error related to broadcast.

julia> typeof(cuarray + cusparse)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKern...

julia> typeof(cusparse + cuarray)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKern...

Adding two cuda sparse arrays together gives a undefined error.

julia> typeof(cusparse + cusparse)
ERROR: UndefVarError: geam not defined
Stacktrace:
 [1] +(A::CUDA.CUSPARSE.CuSparseMatrixCSC{Int64}, B::CUDA.CUSPARSE.CuSparseMatrixCSC{Int64})
   @ CUDA.CUSPARSE ~/.julia/packages/CUDA/VGl9W/lib/cusparse/interfaces.jl:83

If we check matrix multiplication, it gives different types of array.

julia> cusparse = cu(sp')
4×3 adjoint(::CUDA.CUSPARSE.CuSparseMatrixCSC{Int64}) with eltype Int64:
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007f0e555cc160.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/UBzTm/src/host/indexing.jl:56
 1  1  1
 0  1  0
 0  0  1
 1  1  0

julia> typeof(cuarray * cusparse)
┌ Warning: Performing scalar indexing on task Task (runnable) @0x00007f0e54d8c010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays ~/.julia/packages/GPUArrays/UBzTm/src/host/indexing.jl:56
Matrix{Int64} (alias for Array{Int64, 2})

julia> typeof(cusparse * cuarray)
CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}

julia> typeof(cusparse * cu(sp))
Matrix{Int64} (alias for Array{Int64, 2})

This issue is partially reported in JuliaGPU/CUDA.jl#829.

System information:

julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 12

CUDA.jl version information:

(@v1.6) pkg> st CUDA
      Status `~/.julia/environments/v1.6/Project.toml`
  [052768ef] CUDA v3.4.1

maleadt commented 3 years ago

Yeah, the CUSPARSE wrappers are underdeveloped. It'd take somebody with some knowledge of SparseArrays to clean it up and make the API consistent. Most of the low-level wrappers are there, so it shouldn't be too difficult.

Note that you should always be running with allowscalar(false); that 'warning' you see there is just missing functionality.

maleadt commented 3 years ago

I've reverted https://github.com/JuliaGPU/CUDA.jl/pull/1152 for now; we're close to a release, and I don't want to support the functionality noted in https://github.com/JuliaGPU/CUDA.jl/issues/1188. I should have paid better attention while reviewing, sorry.

yuehhua commented 3 years ago

I checked #1188. I agree with your opinion, but maybe we can have sparse*dense operations and keep sparseness property as well?

The reason to drop sparseness is the result of sparse*dense operations is near a dense matrix.

julia> A = sprand(10, 10, 0.3);

julia> B = rand(10, 10);

julia> A*B
10×10 Matrix{Float64}:
 0.465076  0.527751  0.35038    0.586986  0.593877  0.585854  0.545579  0.925054  0.390776  0.997304
 0.509297  1.23225   0.621969   0.810766  1.41307   0.785755  1.14377   1.72318   0.515378  1.55164
 0.462596  1.28949   0.387597   0.684605  1.2138    0.811949  1.62566   0.940938  0.558219  1.77915
 1.34891   1.83833   1.34011    1.90552   1.73739   2.11524   2.00659   2.88412   1.73676   2.8475
 0.928638  1.28882   0.0640811  1.11238   1.05766   0.80175   1.40701   1.09999   1.10325   1.23723
 0.510428  1.51749   1.16855    0.842929  1.29129   1.09404   1.33339   1.5851    1.14017   1.15963
 0.622343  1.03237   0.899818   0.723428  1.62032   0.880723  1.43334   1.47368   0.532667  1.33439
 0.452926  0.608755  0.278723   0.688802  0.44316   0.404797  0.398763  0.934311  0.451508  0.770635
 0.305235  0.592125  0.344045   0.286572  0.894063  0.324949  0.840907  0.826808  0.333906  0.35335
 0.785867  0.688865  0.343108   0.725694  0.907075  0.95241   1.15003   1.11144   0.812543  1.10072

The design in Julia core also drops sparseness property. In my opinion, it is reasonable to do that, but keeping sparseness is also acceptable for me.

maleadt commented 3 years ago

It's not that dropping sparseness for the output is bad, it's that the PR eagerly promoted the sparse input to dense, and performed the multiplication using CUBLAS. If CUSPARSE doesn't have a way to do sparse*dense, I think it's misleading to pretend it does.

Base does still perform the multiplication sparsely: https://github.com/JuliaLang/julia/blob/2f00fe1d10eb54ee697abf09169b396b9264cb53/stdlib/SparseArrays/src/linalg.jl#L30-L48

yuehhua commented 3 years ago

Besides sparse*dense operations, the main purpose for this issue and #1152 is to provide the addition and multiplication operations. But the addition operations are reverted in #1188 as well. I am wondering the reason to revert the addition operations, which is independent of sparse*dense operations.

If CUSPARSE doesn't have a way to do sparse*dense, I think it's misleading to pretend it does.

If we don't provide the operation that CUSPARSE doesn't have. Another question is that where should user get such operations like sparse*dense that is not support natively in CUSPARSE.

maleadt commented 3 years ago

I am wondering the reason to revert the addition operations, which is independent of sparse*dense operations.

I just reverted the entire PR, that was easier. Happy to see the noncontroversial bits restored though :slightly_smiling_face:

If we don't provide the operation that CUSPARSE doesn't have. Another question is that where should user get such operations like sparse*dense that is not support natively in CUSPARSE.

Right, that's exactly the question. I think that its better to report an error, maybe even a helpful one (CUSPARSE doesn't support this operation, please use dense multiplication) than just promote to dense and perform the operation using CUBLAS. The sparse input might not even fit in memory when converted to dense.

Longer term, the approach we've taken with other libraries is to provide native implementations that extend the applicability (e.g. a native gemm that works with all types). https://github.com/JuliaGPU/CUDA.jl/pull/1106 is a step in that direction.

yuehhua commented 3 years ago

OK, so let me check the whole picture.

We hope to have operations defined in kernel rather than wrapping from CUDA lib, right? We only need simple wrapping from CUDA lib and build more features on top of it in kernel.

maleadt commented 3 years ago

We hope to have operations defined in kernel rather than wrapping from CUDA lib, right? We only need simple wrapping from CUDA lib and build more features on top of it in kernel.

We try to use libraries as much as possible, typically because they perform well, and rely on our own kernels when functionality is missing (or performance is lacking).

yuehhua commented 3 years ago

OK, I will implement these functionalities in pieces and you can check if the direction fits the whole picture.

CarloLucibello commented 3 years ago

It looks like CUSPARSE supports dense-mat times sparse-mat multiplication and we wrap it https://github.com/JuliaGPU/CUDA.jl/blob/dae1e183891577f6e477ecb5167b971812b05c31/lib/cusparse/generic.jl#L153 and supposedly we hook it up to LinearAlgebra.mul! https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/interfaces.jl so I don't understand why we fall back to generic matmul in this example

julia> using CUDA, CUDA.CUSPARSE, SparseArrays

julia> CUDA.allowscalar(false)

julia> Acsr = CuSparseMatrixCSR(sprand(Float32, 5, 5, 0.3))
5×5 CuSparseMatrixCSR{Float32, Int32} with 6 stored entries:
  ⋅          0.24761927   ⋅           ⋅         ⋅ 
 0.29497266  0.6143769    ⋅          0.996209   ⋅ 
  ⋅          0.99555695   ⋅           ⋅         ⋅ 
  ⋅           ⋅           ⋅           ⋅         ⋅ 
  ⋅           ⋅          0.17921269   ⋅         ⋅ 

julia> x = CUDA.ones(2,5)
2×5 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

julia> x * Acsr
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] assertscalar(op::String)
    @ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/indexing.jl:53
  [3] getindex(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/indexing.jl:86
  [4] _generic_matmatmul!(C::Matrix{Float32}, tA::Char, tB::Char, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuSparseMatrixCSR{Float32, Int32}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:835
  [5] generic_matmatmul!(C::Matrix{Float32}, tA::Char, tB::Char, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuSparseMatrixCSR{Float32, Int32}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:802
  [6] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:302 [inlined]
  [7] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
  [8] *(A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuSparseMatrixCSR{Float32, Int32})
    @ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:153
  [9] top-level scope
    @ REPL[42]:1
 [10] top-level scope
    @ ~/.julia/packages/CUDA/NQtsu/src/initialization.jl:52

CarloLucibello commented 3 years ago

Actually sparse * dense works.

julia> Acsr * x'
5×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.408549  0.408549
 0.188773  0.188773
 0.449189  0.449189
 0.974001  0.974001
 0.765507  0.765507

We just have to add methods for dense * sparse.

mzy2240 commented 1 year ago

CuSparseMatrixCSC

Just wondering, is there any performance advantage to use GPU to solve this kind of sparse * dense operation? Let's say for a 10_000x10_000 sparse matrix with very low sparsity (let's say 0.1%) and a similar-size dense matrix, will it be faster than multithreading mul!?

JuliaGPU / CUDA.jl

Addition and multiplication over cuarray and cusparse generates inconsistent outcome types or error #1113