Open yuehhua opened 3 years ago
Yeah, the CUSPARSE wrappers are underdeveloped. It'd take somebody with some knowledge of SparseArrays to clean it up and make the API consistent. Most of the low-level wrappers are there, so it shouldn't be too difficult.
Note that you should always be running with allowscalar(false)
; that 'warning' you see there is just missing functionality.
I've reverted https://github.com/JuliaGPU/CUDA.jl/pull/1152 for now; we're close to a release, and I don't want to support the functionality noted in https://github.com/JuliaGPU/CUDA.jl/issues/1188. I should have paid better attention while reviewing, sorry.
I checked #1188. I agree with your opinion, but maybe we can have sparse*dense
operations and keep sparseness property as well?
The reason to drop sparseness is the result of sparse*dense
operations is near a dense matrix.
julia> A = sprand(10, 10, 0.3);
julia> B = rand(10, 10);
julia> A*B
10×10 Matrix{Float64}:
0.465076 0.527751 0.35038 0.586986 0.593877 0.585854 0.545579 0.925054 0.390776 0.997304
0.509297 1.23225 0.621969 0.810766 1.41307 0.785755 1.14377 1.72318 0.515378 1.55164
0.462596 1.28949 0.387597 0.684605 1.2138 0.811949 1.62566 0.940938 0.558219 1.77915
1.34891 1.83833 1.34011 1.90552 1.73739 2.11524 2.00659 2.88412 1.73676 2.8475
0.928638 1.28882 0.0640811 1.11238 1.05766 0.80175 1.40701 1.09999 1.10325 1.23723
0.510428 1.51749 1.16855 0.842929 1.29129 1.09404 1.33339 1.5851 1.14017 1.15963
0.622343 1.03237 0.899818 0.723428 1.62032 0.880723 1.43334 1.47368 0.532667 1.33439
0.452926 0.608755 0.278723 0.688802 0.44316 0.404797 0.398763 0.934311 0.451508 0.770635
0.305235 0.592125 0.344045 0.286572 0.894063 0.324949 0.840907 0.826808 0.333906 0.35335
0.785867 0.688865 0.343108 0.725694 0.907075 0.95241 1.15003 1.11144 0.812543 1.10072
The design in Julia core also drops sparseness property. In my opinion, it is reasonable to do that, but keeping sparseness is also acceptable for me.
It's not that dropping sparseness for the output is bad, it's that the PR eagerly promoted the sparse input to dense, and performed the multiplication using CUBLAS. If CUSPARSE doesn't have a way to do sparse*dense, I think it's misleading to pretend it does.
Base does still perform the multiplication sparsely: https://github.com/JuliaLang/julia/blob/2f00fe1d10eb54ee697abf09169b396b9264cb53/stdlib/SparseArrays/src/linalg.jl#L30-L48
Besides sparse*dense
operations, the main purpose for this issue and #1152 is to provide the addition and multiplication operations. But the addition operations are reverted in #1188 as well. I am wondering the reason to revert the addition operations, which is independent of sparse*dense
operations.
If CUSPARSE doesn't have a way to do sparse*dense, I think it's misleading to pretend it does.
If we don't provide the operation that CUSPARSE doesn't have. Another question is that where should user get such operations like sparse*dense
that is not support natively in CUSPARSE.
I am wondering the reason to revert the addition operations, which is independent of
sparse*dense
operations.
I just reverted the entire PR, that was easier. Happy to see the noncontroversial bits restored though :slightly_smiling_face:
If we don't provide the operation that CUSPARSE doesn't have. Another question is that where should user get such operations like
sparse*dense
that is not support natively in CUSPARSE.
Right, that's exactly the question. I think that its better to report an error, maybe even a helpful one (CUSPARSE doesn't support this operation, please use dense multiplication
) than just promote to dense and perform the operation using CUBLAS. The sparse input might not even fit in memory when converted to dense.
Longer term, the approach we've taken with other libraries is to provide native implementations that extend the applicability (e.g. a native gemm
that works with all types). https://github.com/JuliaGPU/CUDA.jl/pull/1106 is a step in that direction.
OK, so let me check the whole picture.
We hope to have operations defined in kernel rather than wrapping from CUDA lib, right? We only need simple wrapping from CUDA lib and build more features on top of it in kernel.
We hope to have operations defined in kernel rather than wrapping from CUDA lib, right? We only need simple wrapping from CUDA lib and build more features on top of it in kernel.
We try to use libraries as much as possible, typically because they perform well, and rely on our own kernels when functionality is missing (or performance is lacking).
OK, I will implement these functionalities in pieces and you can check if the direction fits the whole picture.
It looks like CUSPARSE supports dense-mat times sparse-mat multiplication and we wrap it
https://github.com/JuliaGPU/CUDA.jl/blob/dae1e183891577f6e477ecb5167b971812b05c31/lib/cusparse/generic.jl#L153
and supposedly we hook it up to LinearAlgebra.mul!
https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/interfaces.jl
so I don't understand why we fall back to generic matmul in this example
julia> using CUDA, CUDA.CUSPARSE, SparseArrays
julia> CUDA.allowscalar(false)
julia> Acsr = CuSparseMatrixCSR(sprand(Float32, 5, 5, 0.3))
5×5 CuSparseMatrixCSR{Float32, Int32} with 6 stored entries:
⋅ 0.24761927 ⋅ ⋅ ⋅
0.29497266 0.6143769 ⋅ 0.996209 ⋅
⋅ 0.99555695 ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ 0.17921269 ⋅ ⋅
julia> x = CUDA.ones(2,5)
2×5 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0
julia> x * Acsr
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:33
[2] assertscalar(op::String)
@ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/indexing.jl:53
[3] getindex(::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64)
@ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/indexing.jl:86
[4] _generic_matmatmul!(C::Matrix{Float32}, tA::Char, tB::Char, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuSparseMatrixCSR{Float32, Int32}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
@ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:835
[5] generic_matmatmul!(C::Matrix{Float32}, tA::Char, tB::Char, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuSparseMatrixCSR{Float32, Int32}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
@ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:802
[6] mul!
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:302 [inlined]
[7] mul!
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
[8] *(A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuSparseMatrixCSR{Float32, Int32})
@ LinearAlgebra /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:153
[9] top-level scope
@ REPL[42]:1
[10] top-level scope
@ ~/.julia/packages/CUDA/NQtsu/src/initialization.jl:52
Actually sparse * dense
works.
julia> Acsr * x'
5×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.408549 0.408549
0.188773 0.188773
0.449189 0.449189
0.974001 0.974001
0.765507 0.765507
We just have to add methods for dense * sparse
.
CuSparseMatrixCSC
Just wondering, is there any performance advantage to use GPU to solve this kind of sparse * dense
operation? Let's say for a 10_000x10_000 sparse matrix with very low sparsity (let's say 0.1%) and a similar-size dense matrix, will it be faster than multithreading mul!
?
It is reasonable to add dense array and sparse array and gives a dense array.
But, in cuda, adding cuarray and cusparse array gives a error related to broadcast.
Adding two cuda sparse arrays together gives a undefined error.
If we check matrix multiplication, it gives different types of array.
This issue is partially reported in JuliaGPU/CUDA.jl#829.
System information:
CUDA.jl version information: