JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.54k stars 5.47k forks source link

Regression in sparse-dense multiplication #52137

Open dkarrasch opened 10 months ago

dkarrasch commented 10 months ago

Running this piece of code, I get

using LinearAlgebra, SparseArrays, BenchmarkTools

function _spmatmul!(C, A, B, α, β)
    size(A, 2) == size(B, 1) || throw(DimensionMismatch())
    size(A, 1) == size(C, 1) || throw(DimensionMismatch())
    size(B, 2) == size(C, 2) || throw(DimensionMismatch())
    nzv = nonzeros(A)
    rv = rowvals(A)
    β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
    for k in 1:size(C, 2)
        @inbounds for col in 1:size(A, 2)
            αxj = B[col,k] * α
            for j in nzrange(A, col)
                C[rv[j], k] += nzv[j]*αxj
            end
        end
    end
    C
end

n = 1000;
A = sparse(Float64, I, n, n);
b = Vector{Float64}(undef, n);
x = ones(n);
@btime _spmatmul!($b, $A, $x, true, false);
#  1.703 μs (0 allocations: 0 bytes) on v1.9.3
# 2.161 μs (0 allocations: 0 bytes) on v1.10.0-rc1
# 2.945 μs (0 allocations: 0 bytes) on an 11 days old master

Multiplication dispatch has changed over the 1.9 to 1.10 transition, but this is nothing but the barebone mutliplication code that we used to have "ever since", so without character processing and all that. And since this is not calling high-level functions, the issue must be outside of SparseArrays.jl, AFAIU.

x-ref https://github.com/JuliaSparse/SparseArrays.jl/issues/469

aravindh-krishnamoorthy commented 10 months ago

An initial analysis below. Firstly, I used PProf to profile the function _spmatmul! for v1.9.3 and https://github.com/JuliaLang/julia/commit/85d7ccad2cc2154d9c3371283512eec33252cc40. Then, I compared the Intel assembly obtained via @code_native (still underway).

PProf

Details v1.9.3: ```julia _spmatmul!(::Vector{Float64}, ::SparseMatrixCSC{Float64, Int64}, ::Vector{Float64}, ::Bool, ::Bool) /home/ark/exp/52137.jl Total: 52298 53841 (flat, cum) 99.86% 0 1053 1053 1 . . using LinearAlgebra, SparseArrays, BenchmarkTools 2 . . 3 46 46 function _spmatmul!(C, A, B, α, β) 4 115 115 size(A, 2) == size(B, 1) || throw(DimensionMismatch()) 5 7 7 size(A, 1) == size(C, 1) || throw(DimensionMismatch()) 6 . . size(B, 2) == size(C, 2) || throw(DimensionMismatch()) 7 . . nzv = nonzeros(A) 8 . . rv = rowvals(A) 9 . 1543 β != one(β) && LinearAlgebra._rmul_or_fill!(C, β) 10 . . for k in 1:size(C, 2) 11 . . @inbounds for col in 1:size(A, 2) 12 . . αxj = B[col,k] * α 13 10181 10181 for j in nzrange(A, col) 14 32632 32632 C[rv[j], k] += nzv[j]*αxj 15 7388 7388 end 16 853 853 end 17 . . end 18 23 23 C 19 . . end 20 . . ``` 85d7cca: ```julia _spmatmul!(::Vector{Float64}, ::SparseMatrixCSC{Float64, Int64}, ::Vector{Float64}, ::Bool, ::Bool) /home/ark/exp/52137.jl Total: 76441 77940 (flat, cum) 100% 0 8239 8239 1 . . using LinearAlgebra, SparseArrays, BenchmarkTools 2 . . 3 135 135 function _spmatmul!(C, A, B, α, β) 4 77 77 size(A, 2) == size(B, 1) || throw(DimensionMismatch()) 5 . . size(A, 1) == size(C, 1) || throw(DimensionMismatch()) 6 . . size(B, 2) == size(C, 2) || throw(DimensionMismatch()) 7 . . nzv = nonzeros(A) 8 . . rv = rowvals(A) 9 15 1514 β != one(β) && LinearAlgebra._rmul_or_fill!(C, β) 10 . . for k in 1:size(C, 2) 11 2 2 @inbounds for col in 1:size(A, 2) 12 . . αxj = B[col,k] * α 13 10453 10453 for j in nzrange(A, col) 14 34117 34117 C[rv[j], k] += nzv[j]*αxj 15 22692 22692 end 16 690 690 end 17 . . end 18 21 21 C 19 . . end 20 . . ```

Observations

  1. Both lines 13 nzrange and 14 (multiply) entail around the same complexity in both versions.
  2. Line 15 end seems to take a lot more cycles in 85d7cca.

Assembly

Assembly @code_native on my Intel PC, with Intel syntax, for both cases is attached: 1.9.3.asm.txt 85d7cca.asm.txt

Observations

  1. 85d7cca results in a much longer assembly and unrolling (?). The file is much longer than for v1.9.4.
  2. From the assembly files, the inner loop cycles for C[rv[j], k] += nzv[j]*αxj in both cases seem to be the same (4 instructions, uses SSE xmmX).
  3. My best guess as of today (after looking at the asm files) is that the new code in 85d7cca attempts to do a 4x loop unroll of the inner loop in Line 14 above, but ends up causing a register spill. The cycles counted in Line 15 seem to be from the register spills and reloads. (I also see that some iterator code has changed, but I feel this shouldn't cause such a huge difference in performance.)

More later!

KristofferC commented 10 months ago

Could maybe be an LLVM upgrade where the vectorizer now does a worse job?

aravindh-krishnamoorthy commented 10 months ago

LLVM Code

Comparing the LLVM code, generated using the command @code_llvm _spmatmul!(b, A, x, true, false), we have the following observations. The files are here: 1.9.3.llvm.txt and 85d7cca.llvm.txt.

  1. The LLVM code generated for https://github.com/JuliaLang/julia/commit/85d7ccad2cc2154d9c3371283512eec33252cc40 is longer than that for 1.9.3.
  2. 4x loop unrolling, which was observed in the native code above, seems to have already been performed in the LLVM code.

Hence, at this point, my best guess is that the 4x unrolling in LLVM code causes register spills and reloads (2x of 64 bits each) when mapping to my Intel CPU (i7-1165G7), leading to the increased cycles observed in Line 15 of the above post.

-> Any suggestions to confirm this guess are welcome!

NB: It could be that the generated native code itself is suboptimal. But, based on my initial reading, the code structure, except for minor differences in code and the 4x unrolling, seems pretty similar between the two versions.

aravindh-krishnamoorthy commented 10 months ago

See also #52429

ViralBShah commented 4 months ago

On M-series macs, I see 1.1 μs on 1.9 and 1.10 and 1.25 μs on 1.11-rc1 and 1.12-dev.

On x64, the gap is larger: 1.2 on 1.9 and 1.5 on 1.11-rc1 and 1.12-dev.

KristofferC commented 4 months ago

I bisected this to dbd82a4dbab0582a345679eb83b2d99d40c0356a (https://github.com/JuliaLang/julia/pull/49747).

It's a bit funny because in the PR it shows that sparse matmul seems to have gotten the biggest improvement from this change. Also, the PR seems to only have added optimization passes so perhaps LLVM is being dumb and one of the passes makes things worse.

cc @gbaraldi, @pchintalapudi

Edit: I also noticed that the benchmark case has a diagonal sparse matrix which isn't very representative... Each nzrange(A, col) will thus only have a single element so any optimization assuming that will be more than one is wasted.