Open dkarrasch opened 10 months ago
An initial analysis below. Firstly, I used PProf
to profile the function _spmatmul!
for v1.9.3 and https://github.com/JuliaLang/julia/commit/85d7ccad2cc2154d9c3371283512eec33252cc40. Then, I compared the Intel assembly obtained via @code_native
(still underway).
nzrange
and 14 (multiply) entail around the same complexity in both versions.end
seems to take a lot more cycles in 85d7cca.Assembly @code_native
on my Intel PC, with Intel syntax, for both cases is attached: 1.9.3.asm.txt
85d7cca.asm.txt
C[rv[j], k] += nzv[j]*αxj
in both cases seem to be the same (4 instructions, uses SSE xmmX). 85d7cca
attempts to do a 4x loop unroll of the inner loop in Line 14 above, but ends up causing a register spill. The cycles counted in Line 15 seem to be from the register spills and reloads. (I also see that some iterator code has changed, but I feel this shouldn't cause such a huge difference in performance.)More later!
Could maybe be an LLVM upgrade where the vectorizer now does a worse job?
Comparing the LLVM code, generated using the command @code_llvm _spmatmul!(b, A, x, true, false)
, we have the following observations. The files are here: 1.9.3.llvm.txt and 85d7cca.llvm.txt.
Hence, at this point, my best guess is that the 4x unrolling in LLVM code causes register spills and reloads (2x of 64 bits each) when mapping to my Intel CPU (i7-1165G7), leading to the increased cycles observed in Line 15 of the above post.
-> Any suggestions to confirm this guess are welcome!
NB: It could be that the generated native code itself is suboptimal. But, based on my initial reading, the code structure, except for minor differences in code and the 4x unrolling, seems pretty similar between the two versions.
See also #52429
On M-series macs, I see 1.1 μs
on 1.9 and 1.10 and 1.25 μs
on 1.11-rc1 and 1.12-dev.
On x64, the gap is larger: 1.2
on 1.9 and 1.5
on 1.11-rc1 and 1.12-dev.
I bisected this to dbd82a4dbab0582a345679eb83b2d99d40c0356a (https://github.com/JuliaLang/julia/pull/49747).
It's a bit funny because in the PR it shows that sparse matmul seems to have gotten the biggest improvement from this change. Also, the PR seems to only have added optimization passes so perhaps LLVM is being dumb and one of the passes makes things worse.
cc @gbaraldi, @pchintalapudi
Edit: I also noticed that the benchmark case has a diagonal sparse matrix which isn't very representative... Each nzrange(A, col)
will thus only have a single element so any optimization assuming that will be more than one is wasted.
Running this piece of code, I get
Multiplication dispatch has changed over the 1.9 to 1.10 transition, but this is nothing but the barebone mutliplication code that we used to have "ever since", so without character processing and all that. And since this is not calling high-level functions, the issue must be outside of SparseArrays.jl, AFAIU.
x-ref https://github.com/JuliaSparse/SparseArrays.jl/issues/469