sparse matrix * sparse vector is slow when the vector is relatively dense

augustinyiptong commented 7 years ago

Hi,

The current implementation seems to produce a dense result vector which is then converted to sparse before returning. I ran into memory scaling issues. Perhaps something along the lines of my_mul below would be worth having?

A = sprand(100000000,1000,.00001)
x = sprand(1000,  0.3)

@time y1=A*x
@time y2=my_mul(A,x)

@assert y1.nzval==y2.nzval
@assert y1.nzind==y2.nzind

 0.511000 seconds (11 allocations: 778.183 MB, 16.05% gc time)
 0.073359 seconds (59 allocations: 21.794 MB)

function my_mul{TvA,TiA,TvX,TiX}(A::SparseMatrixCSC{TvA,TiA}, x::AbstractSparseVector{TvX,TiX})
    m, n = size(A)
    length(x) == n || throw(DimensionMismatch())
    Tv = promote_type(TvA, TvX)
    Ti = promote_type(TiA, TiX)

    xnzind = x.nzind
    xnzval = x.nzval
    Acolptr = A.colptr
    Arowval = A.rowval
    Anzval = A.nzval
    ynzind=Ti[]
    ynzval=Tv[]

    m == 0 && return sparsevec(ynzind, ynzval, m)

    @inbounds for i = 1:length(xnzind)
        v = xnzval[i]
        if v != zero(v)
            j = xnzind[i]
            for r = A.colptr[j]:(Acolptr[j+1]-1)
                push!(ynzind, Arowval[r])
                push!(ynzval, Anzval[r] * v)
            end
        end
    end

    return sparsevec(ynzind, ynzval, m)
end

Sacha0 commented 7 years ago

Hello Augustin! Out of curiosity, what is your application? The extreme aspect ratio matrix with which you benchmark (A = sprand(10^8, 10^3, 10.0^(-5))) is a pathological case for the existing approach; note that the existing approach performs substantially better than my_mul in e.g. the following more typical case:

julia> using BenchmarkTools

julia> A = sprand(10^3, 10^3, 10.0^(-2)); # square, about ten entries per row/column

julia> x = sprand(10^3, 10.0^(-2)); # about ten entries

julia> @benchmark $A*$x
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  23.84 kb
  allocs estimate:  4
  minimum time:     2.37 μs (0.00% GC)
  median time:      3.29 μs (0.00% GC)
  mean time:        6.84 μs (31.31% GC)
  maximum time:     369.23 μs (96.68% GC)

julia> @benchmark my_mul($A, $x)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  15.38 kb
  allocs estimate:  25
  minimum time:     21.09 μs (0.00% GC)
  median time:      21.94 μs (0.00% GC)
  mean time:        23.96 μs (2.99% GC)
  maximum time:     1.84 ms (96.69% GC)

To check whether the existing code has room for (simple) improvement, I sketched a set of naive gather-scatter sparse-sparse matvec methods from scratch (code below, not performance tuned, and could probably avoid the O(size(A, 1)) scan over the workspace in the gather). These methods (spmatspvec) perform about as well as (or slightly better than) the existing code on the square case above (allocation down by about a factor of two, and likely as a result less runtime variance):

julia> @benchmark spmatspvec($A, $x, spmatspvec_alloc($A, $x))
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     9
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  10.41 kb
  allocs estimate:  4
  minimum time:     2.18 μs (0.00% GC)
  median time:      2.72 μs (0.00% GC)
  mean time:        4.11 μs (19.71% GC)
  maximum time:     302.21 μs (97.62% GC)

The story is similar for the case you benchmarked, though with a far more modest allocation reduction (the scatter-gather workspace allocation dominates):

julia> @benchmark $A*$x
BenchmarkTools.Trial:
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  778.20 mb
  allocs estimate:  7
  minimum time:     421.87 ms (2.97% GC)
  median time:      466.40 ms (11.56% GC)
  mean time:        470.39 ms (12.19% GC)
  maximum time:     551.85 ms (24.54% GC)

julia> @benchmark my_mul($A, $x)
BenchmarkTools.Trial:
  samples:          89
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  21.23 mb
  allocs estimate:  52
  minimum time:     52.13 ms (0.00% GC)
  median time:      56.15 ms (3.34% GC)
  mean time:        56.29 ms (3.29% GC)
  maximum time:     62.77 ms (3.10% GC)

julia> @benchmark spmatspvec($A, $x, spmatspvec_alloc($A, $x))
BenchmarkTools.Trial:
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  767.42 mb
  allocs estimate:  7
  minimum time:     428.40 ms (3.12% GC)
  median time:      460.44 ms (11.58% GC)
  mean time:        466.54 ms (12.11% GC)
  maximum time:     554.29 ms (23.98% GC)

But spmatspvec allows you to allocate the gather-scatter workspace ahead of time (and reuse it with reinitialization), so when performing a series of such sparse-sparse matvecs you can achieve substantially better performance on your benchmark:

julia> @benchmark spmatspvec($A, $x, $w)
BenchmarkTools.Trial:
  samples:          47
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  4.48 mb
  allocs estimate:  5
  minimum time:     101.96 ms (0.00% GC)
  median time:      106.70 ms (0.00% GC)
  mean time:        107.08 ms (0.35% GC)
  maximum time:     113.97 ms (0.00% GC)

Much closer to the specialized my_mul for runtime, and with allocation down a few orders of magnitude. With more information about your application, I might be able to offer more. Best!

(spmatspvec implementation and tests:

_checkshapecompat(A::SparseMatrixCSC, x::SparseVector) =
    size(A, 2) == length(x) || throw(DimensionMismatch(string("the number of ",
        "columns of the sparse matrix must match the sparse vector's length")))
_checkshapecompat(A::SparseMatrixCSC, sgw::Vector) =
    size(A, 1) <= length(sgw) || throw(DimensionMismatch(string("the scatter-gather ",
        "workspace vector's length be at least the number of rows of the sparse matrix")))
"""
Given sparse matrix `A` and sparse vector `x` that you would like to multiply (or
surrogates with the shape of `A` and `x`), checks shape compatibility of `A` and `x`
and allocates an appropriately sized scatter-gather workspace.
""" 
spmatspvec_alloc(A::SparseMatrixCSC, x::SparseVector) = 
    (_checkshapecompat(A, x); zeros(promote_type(eltype(A), eltype(x)), size(A, 1)))
"""
Given sparse matrix `A` and sparse vector `x` that you would like to multiply (or
surrogates with the shape of `A` and `x`), and also a scatter-gather workspace `sgw`,
checks shape compatibility of `A`, `x`, and `sgw`, and then zeros `sgw`.
"""
function spmatspvec_init!(A::SparseMatrixCSC, x::SparseVector, sgw::Vector)
    _checkshapecompat(A, x)
    _checkshapecompat(A, sgw)
    fill!(sgw, 0)
end
"""
Returns `A*x` using scatter-gather workspace `sgw`. Assumes shape compatibility of
`A`, `x`, and `sgw`. Also assumes `sgw` begins zero'd, and returns with `sgw` zero'd.
"""
function spmatspvec(A::SparseMatrixCSC, x::SparseVector, sgw::Vector)
    # Calculate the result's entries, scattering into sgw
    resnnz::Int = 0
    @inbounds for xk in 1:nnz(x)
        xv = x.nzval[xk]
        iszero(xv) && continue
        xi = x.nzind[xk]
        for Ak in A.colptr[xi]:(A.colptr[xi + 1] - 1)
        # for Ak in nzrange(A, xi)
            Ai = A.rowval[Ak]
            Av = A.nzval[Ak]
            sgwv = sgw[Ai]
            iszero(sgwv) && (resnnz += 1)
            sgw[Ai] = sgwv + Av * xv
        end
    end
    # Gather the result's entries, zero'ing sgw
    resnzind = Vector{eltype(x.nzind)}(resnnz)
    resnzval = Vector{eltype(sgw)}(resnnz)
    resk::Int = 1
    # @inbounds for (sgwi, sgwv) in enumerate(sgw)
    @inbounds for sgwi in 1:length(sgw)
        sgwv = sgw[sgwi]
        if !iszero(sgwv)
            resnzind[resk] = sgwi
            resnzval[resk] = sgwv
            resk += 1
            sgw[sgwi] = 0
        end
    end
    return SparseVector(size(A, 1), resnzind, resnzval)
end

using Base.Test
let N = 100, M = 101, p = 0.1
    A = sprand(N, M, p)
    x = sprand(M, p)
    # test allocation
    @test_throws DimensionMismatch 4spmatspvec_alloc(spzeros(N, M), spzeros(N))
    @test isa(spmatspvec_alloc(spzeros(Float64, N, M), spzeros(Float32, M)), Vector{Float64})
    @test length(spmatspvec_alloc(spzeros(N, M), spzeros(M))) == N
    @test iszero(spmatspvec_alloc(A, x))
    # test initialization
    w = spmatspvec_alloc(A, x)
    @test_throws DimensionMismatch spmatspvec_init!(A, spzeros(N), w)
    @test_throws DimensionMismatch spmatspvec_init!(A, x, zeros(1))
    @test (rand!(w); spmatspvec_init!(A, x, w); iszero(w))
    # test kernel
    @test spmatspvec(A, x, w) == A*x
    @test iszero(w)
end

augustinyiptong commented 7 years ago

Hi, thanks for your response.

My application is pricing/managing/reporting reinsurance contracts. We essentially have a simulation with a very large number of events which can potentially take losses. My matrix has a row for each event and a column for each contract. In practice a relatively small number of events actually take losses. Happy to talk more but perhaps this is not the right forum.

The actual length of the sparse vector has not been an issue up to now because all the operations i have used scaled with the non zeros. I dont actually store the full vector anywhere.

Unfortunately A*x allocates memory to store the full vector, which cannot fit in memory. I can see the merit of the scatter-gather method but it doesnt help in my 'pathological' case.

As a matter of principle, sparse operations should scale with nnz and never have to allocate the full vector / matrix. Which something like my_mul achieves.

Pragmatically this might not always be possible or necessary especially if there are no use cases which would benefit.

I am happy to have this issue closed. Thanks again, Gus

StefanKarpinski commented 7 years ago

Couldn't this operation use a polyalgorithm that looks at the actual size/shape to decide the best approach? As long as the result type is predictable, should be fine.

augustinyiptong commented 7 years ago

Chewed a bit more on this and here is a better solution for my specific use case. My A matrix stays the same and i have to perform many multiplications with different x

I pre calculate a condensed matrix Ac which has all the empty rows removed. The multiplication is done much faster with Ac and then the final sparse result is put together.

A = sprand(100000000,1000,.00001)
x = sprand(1000,  0.3)

@time y1=A*x
@time Ac, nzval, A_m = pre_mul(A)
@time y2=c_mul(Ac, nzval, A_m, x)

@assert y1.nzval==y2.nzval
@assert y1.nzind==y2.nzind

  0.493833 seconds (11 allocations: 778.221 MB, 17.29% gc time)
  0.692463 seconds (287 allocations: 123.926 MB, 19.02% gc time)
  0.019538 seconds (19 allocations: 15.207 MB)

function pre_mul(A::SparseMatrixCSC)
    nzval = sort(unique(A.rowval))
    row_dict = Dict{Int64,Int64}(zip(nzval,1:length(nzval)))
    Ac_r = zeros(Int64, length(A.rowval))
    for i in 1:length(A.rowval)
        Ac_r[i] = row_dict[A.rowval[i]]
    end

    Ac_m = length(nzval)
    Ac_n = A.n
    Ac_c = A.colptr
    Ac_nz = A.nzval
    Ac = SparseMatrixCSC(Ac_m, Ac_n, Ac_c, Ac_r, Ac_nz)    

    return Ac, nzval, A.m
end

function c_mul(Ac::SparseMatrixCSC, nzval::AbstractArray, A_m::Int64, x::AbstractSparseVector)
    T = promote_type(eltype(Ac), eltype(x))
    yc = Array{T}(Ac.m)
    A_mul_B!(yc, Ac, x)
    y = dropzeros!(SparseVector(A_m, deepcopy(nzval), yc))
end

Sacha0 commented 7 years ago

As a matter of principle, sparse operations should scale with nnz and never have to allocate the full vector / matrix.

<digression> A brief technical/historical digression :). Many operations on compressed column sparse matrices use at least O(M), O(N), or O(M + N) additional storage (for matrix dimensions M-by-N): The compressed column data structure itself contains a dense vector of length at least N + 1, so additional storage of O(N) does not impact asymptotic storage requirements (and likewise for O(M) and O(M + N) additional storage where M and N are of the same order). (The implicit assumption is that the stored matrix is not hypersparse, i.e. nnz exceeds N by an at least O(1) [and typically greater] factor, such that O(N) (with small constant factor) workspace is negligible. And M being the same order as N is often also assumed, though that assumption breaks down in some applications such as yours.) As such scatter-gather via a dense vector is a widespread technique and suitable for most compressed column sparse matrices; slightly more complex implementations avoid O(M)(/O(N)) scans, though in many cases an O(M)(/O(N)) scan is negligible due to necessary O(nnz) scans where nnz exceeds N as above. </digression>

A common approach to efficient sparse-sparse matvecs in cases like yours: For each nonzero value in x (x[j]), from the corresponding column in A generate a list of (rowind, resval) pairs (for nonzero resvals A[i,j]*x[j]). Then multi-way merge those lists into a sparse vector. (Lots of opportunity for parallelism.)

The approach you take above to exploiting the additional information that A remains the same over a series of sparse-sparse matvecs is great! Below is a version of that approach that makes the implicit scatter-gather explicit and recycles the corresponding workspace. On your test case it achieves better than a factor of three reduction in allocation and a bit shy of a factor of two runtime reduction:

julia> using BenchmarkTools

julia> m, n, p = 10^8, 10^3, 10.0^(-5);

julia> A = sprand(m, n, p);

julia> x = sprand(n, 0.3);

julia> Ac, nzrows, Am = pre_mul(A);

julia> @benchmark y = c_mul($Ac, $nzrows, $Am, $x)
BenchmarkTools.Trial:
  samples:          381
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  15.19 mb
  allocs estimate:  7
  minimum time:     10.77 ms (0.00% GC)
  median time:      13.07 ms (13.23% GC)
  mean time:        13.14 ms (12.22% GC)
  maximum time:     17.13 ms (17.60% GC)

julia> aux = spmatspvec_init(A, promote_type(eltype(A), eltype(x)));

julia> @benchmark y = spmatspvec($A, $x, $aux)
BenchmarkTools.Trial:
  samples:          581
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  4.30 mb
  allocs estimate:  5
  minimum time:     7.05 ms (0.00% GC)
  median time:      8.10 ms (0.00% GC)
  mean time:        8.61 ms (5.09% GC)
  maximum time:     11.94 ms (23.38% GC)

You could avoid the remaining scatter-gather workspace scan by using e.g. a singly linked list to track which entries you populate in the scatter-gather workspace. Chances are that approach would only reduce runtime for significantly less populated x.

using Base.SparseArrays: indtype

struct SpSpMatvecAux{Tv,Ti<:Integer}
    condrowval::Vector{Ti}
    nonzerorows::Vector{Ti}
    sgw::Vector{Tv}
end
function spmatspvec_init{reseltype}(A::SparseMatrixCSC, ::Type{reseltype})
    TiA = indtype(A)
    nzrows = sort!(unique(A.rowval[1:nnz(A)]))
    indexmap = Dict{TiA,TiA}(zip(nzrows, 1:length(nzrows)))
    condrowval = TiA[ indexmap[A.rowval[Ak]] for Ak in 1:nnz(A) ]
    sgworkspace = zeros(reseltype, length(nzrows))
    return SpSpMatvecAux(condrowval, nzrows, sgworkspace)
end
"""
Returns `A*x` using auxiliary `aux`. Assumes shape compatibility of
`A` and `x`, and assumes `aux` was generated for `A` and `x` and not
modified outside this routine.
"""
function spmatspvec(A::SparseMatrixCSC, x::SparseVector, aux::SpSpMatvecAux)
    # Calculate the result's entries, scattering into aux.sgw
    resnnz::Int = 0
    @inbounds for xk in 1:nnz(x)
        xv = x.nzval[xk]
        iszero(xv) && continue
        xi = x.nzind[xk]
        # for Ak in nzrange(A, xi)
        for Ak in A.colptr[xi]:(A.colptr[xi + 1] - 1)
            Ai = aux.condrowval[Ak]
            Av = A.nzval[Ak]
            sgwv = aux.sgw[Ai]
            iszero(sgwv) && (resnnz += 1)
            aux.sgw[Ai] = sgwv + Av * xv
        end
    end
    # Gather the result's entries, zero'ing aux.sgw, and mapping back to actual row indices
    resnzind = Vector{indtype(x)}(resnnz)
    resnzval = Vector{eltype(aux.sgw)}(resnnz)
    resk::Int = 1
    sgwi, sgwimax = 1, length(aux.sgw)
    @inbounds while sgwi <= sgwimax && resk <= resnnz
        sgwi += 1
        sgwv = aux.sgw[sgwi]
        if !iszero(sgwv)
            resnzind[resk] = aux.nonzerorows[sgwi]
            resnzval[resk] = sgwv
            resk += 1
            aux.sgw[sgwi] = 0
        end
    end
    return SparseVector(size(A, 1), resnzind, resnzval)
end

Best!

augustinyiptong commented 7 years ago

Hi @Sacha0

I think it should be: sgwi, sgwimax = 0, length(aux.sgw) ?

Sacha0 commented 7 years ago

Good catch! Looks like either: (1) sgwi should be initialized to zero and the condition in the while changed to sgwi < sgwimax; or (2) sgwi += 1 should move to the end of the while. Best!

augustinyiptong commented 7 years ago

Hi @Sacha0

iszero(sgwv) && (resnnz += 1) is not foolproof. Consider:

A = sparse(ones(1,3))
x = sparsevec([1.0, -1.0, 1.0])
aux = spmatspvec_init(A, Float64)
spmatspvec(A,x,aux)

gives

Sparse vector of length 1 with 2 Float64 nonzero entries:
  [1]  =  1.0
  [2147420144]  =  0.0

resnnz = countnz(aux.sgw) avoids this but obviously slower.

Sacha0 commented 7 years ago

Also a good catch! There are a number of ways you can fix that issue without adding another sweep (countnz). For one, you could count the number of nonzero values in the process of the last (gathering) while loop, and correct the lengths of resnzind and resnzval afterward. For two, you could use a more sophisticated check in the first (scattering) while loop, e.g. replacing the iszero(sgw)... and following line with

aux.sgwv[Ai] = newval = sgwv + Av * xv
if iszero(sgwv) && !iszero(newval)
    resnnz += 1
elseif !iszero(sgwv) && iszero(newval)
    resnnz -= 1
end

should do the trick. The second approach should be more efficient than the first in some cases. Best!

simonbyrne commented 7 years ago

Was this actually resolved? I got bitten by this today (on 0.6), and ended up hacking my own together (https://gist.github.com/simonbyrne/43a4041c76e18f79837b06dec459b820).

augustinyiptong commented 7 years ago

No official fix.

simonbyrne commented 7 years ago

Okay, in that case I think we should reopen it. Sounds like some sort of polyalgorithm is the way forward as @StefanKarpinski suggested.

oscardssmith commented 7 years ago

Would it make sense to have a "HyperSparse" matrix type (probably in a package) for situations like this?

simonbyrne commented 7 years ago

I’m not sure what’s you mean: care to expand?

StefanKarpinski commented 7 years ago

Hypersparse typically means sparse in both rows and columns, i.e. efficient even in the case when nnz << min(nrows, ncols). There are a few data structures that are typically used for this kind of thing. @ViralBShah would know more about it.

simonbyrne commented 7 years ago

Ah thanks. Well that didn't apply in my case: I had nrows = ncols = O(1e6), and nnz = O(1e8)

ViralBShah commented 7 years ago

The sparse data structure we use (CSC) is a dense vector of sparse columns. It is good for linear algebra type stuff and the traditional finite element style problems, but not good for hypersparse problems that have less than 1 nonzero per row/column on average (if you use CSC for these, you are just spinning your wheels iterating through empty columns 90% of the time). You can do significantly better as discussed here.

I think auto-detecting all this is a bit too magical. Perhaps a HyperSparse array type is the right way to go. Or perhaps implement some of this on the NDSparse data structure that @shashi and @JeffBezanson have in IndexedTables.

ViralBShah commented 7 years ago

Ah seeing @simonbyrne 's comment above, maybe we have general inefficiencies in here. His case is certainly not hypersparse.

ViralBShah commented 5 years ago

Since we have a sparse vector now, we should stop converting this one to dense in the sparse-sparse matvec.

goggle commented 5 years ago

I have tried many things to address this issue and must conclude, that the current solution (writing the results of the multiplication into a dense vector, then pruning it) is probably the best general purpose solution.

The problem is, that the number of nonzero values in the result of a multiplication x = A*v cannot be easily determined in advance. We only know that the resulting vector has at most m nonzero entries when the matrix A is of dimension (m, n).

The suggested solutions work well for certain cases where v has only a few nonzero entries compared to its length n, but they fail in more general cases like

A = sprandn(10_000, 10_000, 0.01)
v = sprand(10_000, 0.05)

Another idea was to adopt the algorithm for sparse matrix times sparse matrix multiplication. A quick benchmark (using the A and v from above and V = SparseMatrixCSC(v) shows the following:

julia> @benchmark A*v
BenchmarkTools.Trial: 
  memory estimate:  234.64 KiB
  allocs estimate:  7
  --------------
  minimum time:     111.421 μs (0.00% GC)
  median time:      122.007 μs (0.00% GC)
  mean time:        134.550 μs (8.28% GC)
  maximum time:     35.534 ms (99.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark A*V
BenchmarkTools.Trial: 
  memory estimate:  180.98 KiB
  allocs estimate:  7
  --------------
  minimum time:     174.162 μs (0.00% GC)
  median time:      179.654 μs (0.00% GC)
  mean time:        190.914 μs (5.34% GC)
  maximum time:     35.640 ms (99.15% GC)
  --------------
  samples:          10000
  evals/sample:     1

So it consumes slightly less memory than the current solution, but it is also slower... I think speed should be preferred here. People who run into those edge cases can always come up with their individual solutions.

If someone knows an algorithm that is both fast and does not blow up the memory usage for general cases, I'm happy to try it!

JuliaSparse / SparseArrays.jl

sparse matrix * sparse vector is slow when the vector is relatively dense #58