Closed sostock closed 1 year ago
Notice that three argument triangular matrix multiplication is not a BLAS operation. In BLAS this is an in-place operation see https://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaf07edfbb2d2077687522652c9e283e1e.html#gaf07edfbb2d2077687522652c9e283e1e. I'm not in favor of adding extra dispatch versions of the !
methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the !
methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like *
should be as generic as possible.
I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.
So, to do mul!(dest, A, UpperTriangular(B))
efficiently, I have to do something like
copyto!(dest, A)
rmul!(dest, UpperTriangular(B))
am I getting that right?
And, why not create a mul!
method to do this?
I'm not in favor of adding extra dispatch versions of the
!
methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the!
methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like*
should be as generic as possible.
I don’t really understand this argument. The case mul!(full1, triangular, full2)
it already calls out to BLAS, why not do it for the case mul!(full1, full2, triangular)
as well? They would both call the same BLAS method (the one you linked), since it can handle either the first or the second matrix as triangular (it updates the other one). If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?
I think it is really unintuitive that A * B
uses a fast method, but when one wants to preallocate the output, mul!(dest, A, B)
is 30 times slower.
And, why not create a mul! method to do this?
Because you are using the !
methods to optimize your code by avoiding unnecessary allocations and copies so it's not a good idea to introduce a hidden copy. You'd think that the operation is out-of-place when you see the signature.
If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?
Yes but it's probably not worth the effort.
My main point is that you shouldn't use the three argument triangular matrix multiplication if you are trying to write optimized code.
I still think mul!
should have an optimised method here because it should at least not be slower than using mul!
on two full matrices. And yes, in some cases users may accidentally copy a result when they could have used an inplace rmul!
or lmul!
but that extra copy is certainly better than the 100x slow down users currently get and it's still faster than using mul!
on two full matrices. If the use case permits it users can choose to switch to lmul!
or rmul!
. My case doesn't permit it and I don't see why we should force users to essentially write their own version of the mul!
method when their use case doesn't allow for using the in place rmul!
or lmul!
.
@andreasnoack I think your argument is mainly an argument for adding a line to the docs that says users may want to consider using rmul!
or lmul!
instead of mul!
if they want to avoid an extra copy in some cases.
I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.
It uses a tiled algorithm. The allocations reported above are on Julia 1.6, where it uses global arrays and unsafe_wrap
, but still allocates. On master it allocates differently, due to #42309, which explicitly creates new Array
s for the tiles.
julia> @btime mul!($dest, $A, UpperTriangular($B));
894.954 ms (3 allocations: 30.75 KiB)
I may be missing something, but given that we have
and
and that we even fully materialize tranposes/adjoints here
I feel like the missing methods are more like oversight rather than deliberately missing? Of course, fixing the issue requires adding what feels like a dozen new mul!
methods, but OTOH the performance penalty is rather hefty.
I just tried to refresh my memory on this topic and it looks like I agree with 2015 Andreas, see https://github.com/JuliaLang/julia/issues/10019. That issue was closed without introducing new methods. Then later that year https://github.com/JuliaLang/julia/pull/12812 was opened but it wasn't merged. Two years later I merged it without leaving a comment and I don't remember why I did so. Afterwards, we then had to add all the other methods to avoid ambiguities.
So it looks like it might have been a bit random that these methods were introduced in the first. I don't like them and believe that people would be better off without them but given the current state, if somebody makes a PR with these methods and associated tests then I won't try to filibuster it (even though it adds extra compilation to an already very compilation heavy and therefore very slow test file.)
Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym)
, so decided to test this more generally:
using LinearAlgebra, BenchmarkTools
using Base.Iterators: product
A = rand(200, 200)
B = rand(200, 200)
dest = similar(A)
types = [identity, Symmetric, Hermitian, UpperTriangular, UnitUpperTriangular,
LowerTriangular, UnitLowerTriangular, UpperHessenberg,
Tridiagonal, SymTridiagonal ∘ Symmetric,
a->Bidiagonal(diag(a), diag(a, 1), :U), Diagonal]
for (Ta, Tb) in product(types, types)
_A = Ta(A)
_B = Tb(B)
t1 = time(@benchmark $_A * $_B seconds=0.1)
tmul = time(@benchmark mul!(dest, $_A, $_B) seconds=0.1)
if !isapprox(t1, tmul, rtol=0.5)
arg_types = nameof.(typeof.([_A, _B]))
@show arg_types, ratio(t1, tmul)
end
end
On my M1 Mac with Julia 1.9rc2 this yields a bunch of type combinations where mul!
is significantly slower than *
:
([:Symmetric, :Symmetric], 0.15212131077987395)
([:Hermitian, :Symmetric], 0.14867050011769367)
([:Symmetric, :Hermitian], 0.1501201330590954)
([:Hermitian, :Hermitian], 0.15710876467798882)
([:Array, :UpperTriangular], 0.0783870960007864)
([:Symmetric, :UpperTriangular], 0.08869831751722534)
([:Hermitian, :UpperTriangular], 0.08843682327204656)
([:UpperHessenberg, :UpperTriangular], 0.08703282707024972)
([:Diagonal, :UpperTriangular], 0.3072659809309098)
([:Array, :UnitUpperTriangular], 0.0801960959640439)
([:Symmetric, :UnitUpperTriangular], 0.08932123686354122)
([:Hermitian, :UnitUpperTriangular], 0.08990729018466966)
([:UpperHessenberg, :UnitUpperTriangular], 0.08850920305059527)
([:Diagonal, :UnitUpperTriangular], 0.35110059171597635)
([:Array, :LowerTriangular], 0.07920584733372955)
([:Symmetric, :LowerTriangular], 0.08806820049841861)
([:Hermitian, :LowerTriangular], 0.08689892949913336)
([:UpperHessenberg, :LowerTriangular], 0.08785087978687132)
([:Diagonal, :LowerTriangular], 0.2959498808788494)
([:Array, :UnitLowerTriangular], 0.07715129397475415)
([:Symmetric, :UnitLowerTriangular], 0.09019522967933709)
([:Hermitian, :UnitLowerTriangular], 0.08938477232681072)
([:UpperHessenberg, :UnitLowerTriangular], 0.08934690586575288)
([:Diagonal, :UnitLowerTriangular], 0.34585527043905834)
([:Diagonal, :UpperHessenberg], 0.3220858895705521)
([:Diagonal, :Tridiagonal], 0.1578947764617085)
([:Diagonal, :SymTridiagonal], 0.20002717522004712)
([:UpperTriangular, :Bidiagonal], 0.30357989568515886)
([:UnitUpperTriangular, :Bidiagonal], 0.2987819866247383)
([:LowerTriangular, :Bidiagonal], 0.2744375206094478)
([:UnitLowerTriangular, :Bidiagonal], 0.27931542328982784)
([:Diagonal, :Bidiagonal], 0.1732365783162167)
([:UpperTriangular, :Diagonal], 0.024214568979777973)
([:UnitUpperTriangular, :Diagonal], 0.03201487204047439)
([:LowerTriangular, :Diagonal], 0.026398215281650865)
([:UnitLowerTriangular, :Diagonal], 0.029608938547486034)
([:UpperHessenberg, :Diagonal], 0.32266666666666666)
([:Tridiagonal, :Diagonal], 0.19409417260880255)
([:SymTridiagonal, :Diagonal], 0.19127713498622592)
([:Bidiagonal, :Diagonal], 0.1897607790860655)
([:Diagonal, :Diagonal], 0.028365610288337475)
But there are also some cases where mul!
is faster—this could just be caused by improved allocations, of course…
([:Tridiagonal, :UpperTriangular], 6.484450754024589)
([:SymTridiagonal, :UpperTriangular], 6.755345806882726)
([:Tridiagonal, :UnitUpperTriangular], 5.848837490624924)
([:SymTridiagonal, :UnitUpperTriangular], 5.85030633412128)
([:Tridiagonal, :UnitLowerTriangular], 5.854310386415167)
([:SymTridiagonal, :UnitLowerTriangular], 5.8304607385296805)
([:Tridiagonal, :LowerTriangular], 5.472315492957747)
([:SymTridiagonal, :LowerTriangular], 6.163230320159385)
([:SymTridiagonal, :Tridiagonal], 2.412737178282923)
([:Tridiagonal, :Tridiagonal], 2.2284411276948592)
([:Tridiagonal, :Tridiagonal], 2.482101850601108)
([:SymTridiagonal, :Bidiagonal], 2.6301575735495035)
([:Bidiagonal, :Bidiagonal], 2.3222368960335174)
([:Tridiagonal, :Tridiagonal], 2.4564705882352937)
([:Tridiagonal, :SymTridiagonal], 2.210109392682007)
([:SymTridiagonal, :SymTridiagonal], 2.437875751503006)
([:Bidiagonal, :SymTridiagonal], 2.491171597633136)
Noticed a similar (?) issue with
mul!(::Matrix, ::HermOrSym, ::HermOrSym)
I think the main reason for this is that *
is allowed to simply allocate new plain arrays with easy memory access, as in
*(A::HermOrSym, B::HermOrSym) = A * copyto!(similar(parent(B)), B)
and subsequently use specialized BLAS routines. In contrast, mul!(C, ::Hermitian, ::Hermitian)
falls back to generic_matmatmul!
! The point is that the promise of mul!
is that it operates in a memory-aware fashion, and doesn't create copies of the factors. So what we do is that we add mul!
methods whenever we realize a more efficient way of doing the multiplication without creating copies.
Another aspect is that you measure computation times with output dest::Matrix
. In some structured cases, this means that mul!
needs to overwrite many entries by zeros, whereas *
may be able to create a specific output type that has all the zeros "implicit", like *Diagonal
etc. Think of *(::Diagonal, ::Diagonal)
, compared to mul!(::Matrix, ::Diagonal, ::Diagonal)
(compared to mul!(::Diagonal, ::Diagonal, ::Diagonal)
!
Thanks for your reply! I agree with your arguments, and many cases already seem to be addressed by your PR.
It seem logical that a user should, e.g., allocate a Triagular
matrix if they multiply mul!(::Triangular, ::Diagonal, ::[Unit]Triangular)
. Though, it's surprising that dest::StridedMatrix
is 40 times slower…
I'm just not happy with mul!(::StridedMatrix, ::HermOrSym, ::HermOrSym)
(and its 5-arg counterpart).
Wanted to propose something like:
mul!(C::StridedMatrix{T}, A::HermOrSym{T,<:StridedMatrix},
B::HermOrSym{T,<:StridedMatrix}) where {T<:BlasFloat} = mul!(C, copyto!(A.data, A), B)
I guess it's not ideal to mutate A
… But as a user, I'd be less surprised to get symmetrized A.data
, compared to a 7-fold increased runtime?
Would do a corresponding PR, if you think that makes sense.
See https://discourse.julialang.org/t/mul-matrix-matrix-uppertriangular-is-really-slow/69368. CC @Luapulu
mul!
doesn’t use BLAS when the second matrix is anAbstractTriangular
:However, the
*
methods all use BLAS: