fast inplace broadcasting multiplication of SparseMatrixCSC and a Vector

oxinabox commented 6 years ago

I would like to know if this would be a good PR to the SparseArrays stdlib I can't workout what this operation is called. It does A.*b inplace for A::SparseMatrixCSC and a b::AbstractVector.

It seems loosely relevant to JuliaLang/julia#26561 (in that i was thinking about at that problem when I wrote it)

Inplace operations can avoid allocating memory so is faster.

using SparseArrays

function sparse_column_vecmul!(A::SparseMatrixCSC, x::AbstractVector{T}) where T
    size(A,2)==length(x) || DimensionMismatch()
    cols, rows, vals = findnz(A);

    x_ii=1
    x_val = @inbounds x[x_ii]
    rows_to_nan = Int64[]
    for A_ii in 1:length(rows)
        col= @inbounds cols[A_ii]
        row= @inbounds rows[A_ii]
        if row > x_ii #Note that our result is row sorted
            x_ii+=1
            x_val = @inbounds x[x_ii]
            if !isfinite(x_val) 
                # Got to deal with this later, row will become dense.
                push!(rows_to_nan, row)
            end
        end
        @inbounds vals[A_ii]*=x_val
    end

    # Go back and NaN any rows we have to
    for row in rows_to_nan
        for col in SparseArrays.nonzeroinds(@view(A[:,row]))
            # don't do the ones we already hit as they may be Inf (or NaN)
            @inbounds A[row,col] = T(NaN)
        end
    end

    A
end

Benchmarks

using BenchmarkTools
A = sprand(100,10,0.1)
x = rand(100)

@btime A.*x; 7.920 μs (17 allocations: 22.58 KiB)
@btime sparse_column_vecmul!(A, x) 1.044 μs (4 allocations: 2.47 KiB)

Not a perfectly fair comparison as A was being mutated but i doubt that changed the timing.

over 7x speedup is not to be sneered at given how big sparse matrixes become.

mbauman commented 6 years ago

I'm not sure about what to call this function, but could we possibly do a pigeon-hole sort of optimization within the broadcast! definition itself when we detect this case?

bcsj commented 2 years ago

I guess I found a similar bottleneck today. But I think the improved computation can potentially be done in much fewer lines of code. https://discourse.julialang.org/t/speeding-up-elementwise-vector-sparsematrixcsc-multiplication-broadcasting/79437

stevengj commented 2 years ago

I can't workout what this operation is called.

mul!(A, Diagonal(b), A)

?

JuliaSparse / SparseArrays.jl

fast inplace broadcasting multiplication of SparseMatrixCSC and a Vector #47

Benchmarks