JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.69k stars 5.48k forks source link

`mul!(::Matrix{T}, ::Matrix{T}, ::AbstractTriangular{T, <:Matrix{T}}) where T<:BlasFloat` does not use BLAS #42550

Closed sostock closed 1 year ago

sostock commented 3 years ago

See https://discourse.julialang.org/t/mul-matrix-matrix-uppertriangular-is-really-slow/69368. CC @Luapulu

mul! doesn’t use BLAS when the second matrix is an AbstractTriangular:

julia> A = rand(1000, 1000); B = rand(1000, 1000); dest = similar(A);

julia> @btime mul!($dest, $A, $B); # BLAS (fast)
  50.609 ms (0 allocations: 0 bytes)

julia> @btime mul!($dest, UpperTriangular($A), $B); # BLAS (fast)
  32.929 ms (0 allocations: 0 bytes)

julia> @btime mul!($dest, $A, UpperTriangular($B)); # generic (slow)
  914.817 ms (6 allocations: 336 bytes)

However, the * methods all use BLAS:

julia> @btime $A * $B; # BLAS (fast)
  50.560 ms (2 allocations: 7.63 MiB)

julia> @btime UpperTriangular($A) * $B; # BLAS (fast)
  33.183 ms (2 allocations: 7.63 MiB)

julia> @btime $A * UpperTriangular($B); # BLAS (fast)
  27.190 ms (2 allocations: 7.63 MiB)
andreasnoack commented 3 years ago

Notice that three argument triangular matrix multiplication is not a BLAS operation. In BLAS this is an in-place operation see https://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaf07edfbb2d2077687522652c9e283e1e.html#gaf07edfbb2d2077687522652c9e283e1e. I'm not in favor of adding extra dispatch versions of the ! methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the ! methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like * should be as generic as possible.

I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.

Luapulu commented 3 years ago

So, to do mul!(dest, A, UpperTriangular(B)) efficiently, I have to do something like

copyto!(dest, A)
rmul!(dest, UpperTriangular(B))

am I getting that right?

And, why not create a mul! method to do this?

sostock commented 3 years ago

I'm not in favor of adding extra dispatch versions of the ! methods to make the three argument version call out to BLAS. IMO, if you want to optimize by using the ! methods then you should consider the structure of the matrices involved. Otherwise, you'll end up with unnecessary copies. In contrast, the non-mutating methods like * should be as generic as possible.

I don’t really understand this argument. The case mul!(full1, triangular, full2) it already calls out to BLAS, why not do it for the case mul!(full1, full2, triangular) as well? They would both call the same BLAS method (the one you linked), since it can handle either the first or the second matrix as triangular (it updates the other one). If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?

I think it is really unintuitive that A * B uses a fast method, but when one wants to preallocate the output, mul!(dest, A, B) is 30 times slower.

andreasnoack commented 3 years ago

And, why not create a mul! method to do this?

Because you are using the ! methods to optimize your code by avoiding unnecessary allocations and copies so it's not a good idea to introduce a hidden copy. You'd think that the operation is out-of-place when you see the signature.

If you’re not in favor of adding dispatch for the latter case, are you also in favor of removing it for the former?

Yes but it's probably not worth the effort.

My main point is that you shouldn't use the three argument triangular matrix multiplication if you are trying to write optimized code.

Luapulu commented 3 years ago

I still think mul! should have an optimised method here because it should at least not be slower than using mul! on two full matrices. And yes, in some cases users may accidentally copy a result when they could have used an inplace rmul! or lmul! but that extra copy is certainly better than the 100x slow down users currently get and it's still faster than using mul! on two full matrices. If the use case permits it users can choose to switch to lmul! or rmul!. My case doesn't permit it and I don't see why we should force users to essentially write their own version of the mul! method when their use case doesn't allow for using the in place rmul! or lmul!.

@andreasnoack I think your argument is mainly an argument for adding a line to the docs that says users may want to consider using rmul! or lmul! instead of mul! if they want to avoid an extra copy in some cases.

sostock commented 3 years ago

I'm not sure why the generic fallback allocates, though. That shouldn't be necessary so it would be good to figure out why that happens.

It uses a tiled algorithm. The allocations reported above are on Julia 1.6, where it uses global arrays and unsafe_wrap, but still allocates. On master it allocates differently, due to #42309, which explicitly creates new Arrays for the tiles.

julia> @btime mul!($dest, $A, UpperTriangular($B));
  894.954 ms (3 allocations: 30.75 KiB)
dkarrasch commented 3 years ago

I may be missing something, but given that we have

https://github.com/JuliaLang/julia/blob/146de38b4004e3fbd4cacc1e502fd650473cb67f/stdlib/LinearAlgebra/src/triangular.jl#L677-L680

and

https://github.com/JuliaLang/julia/blob/146de38b4004e3fbd4cacc1e502fd650473cb67f/stdlib/LinearAlgebra/src/triangular.jl#L668-L675

and that we even fully materialize tranposes/adjoints here

https://github.com/JuliaLang/julia/blob/146de38b4004e3fbd4cacc1e502fd650473cb67f/stdlib/LinearAlgebra/src/triangular.jl#L682-L685

I feel like the missing methods are more like oversight rather than deliberately missing? Of course, fixing the issue requires adding what feels like a dozen new mul! methods, but OTOH the performance penalty is rather hefty.

andreasnoack commented 3 years ago

I just tried to refresh my memory on this topic and it looks like I agree with 2015 Andreas, see https://github.com/JuliaLang/julia/issues/10019. That issue was closed without introducing new methods. Then later that year https://github.com/JuliaLang/julia/pull/12812 was opened but it wasn't merged. Two years later I merged it without leaving a comment and I don't remember why I did so. Afterwards, we then had to add all the other methods to avoid ambiguities.

So it looks like it might have been a bit random that these methods were introduced in the first. I don't like them and believe that people would be better off without them but given the current state, if somebody makes a PR with these methods and associated tests then I won't try to filibuster it (even though it adds extra compilation to an already very compilation heavy and therefore very slow test file.)

leostenzel commented 1 year ago

Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym), so decided to test this more generally:

using LinearAlgebra, BenchmarkTools
using Base.Iterators: product
A = rand(200, 200)
B = rand(200, 200) 
dest = similar(A)

types = [identity, Symmetric, Hermitian, UpperTriangular, UnitUpperTriangular,
  LowerTriangular, UnitLowerTriangular, UpperHessenberg, 
  Tridiagonal, SymTridiagonal ∘ Symmetric, 
  a->Bidiagonal(diag(a), diag(a, 1), :U), Diagonal]

for (Ta, Tb) in product(types, types)
  _A = Ta(A)
  _B = Tb(B)
  t1 = time(@benchmark $_A * $_B seconds=0.1)
  tmul = time(@benchmark mul!(dest, $_A, $_B) seconds=0.1)
  if !isapprox(t1, tmul, rtol=0.5)
    arg_types = nameof.(typeof.([_A, _B]))
    @show arg_types, ratio(t1, tmul)
  end
end

On my M1 Mac with Julia 1.9rc2 this yields a bunch of type combinations where mul! is significantly slower than *:

([:Symmetric, :Symmetric], 0.15212131077987395)
([:Hermitian, :Symmetric], 0.14867050011769367)
([:Symmetric, :Hermitian], 0.1501201330590954)
([:Hermitian, :Hermitian], 0.15710876467798882)
([:Array, :UpperTriangular], 0.0783870960007864)
([:Symmetric, :UpperTriangular], 0.08869831751722534)
([:Hermitian, :UpperTriangular], 0.08843682327204656)
([:UpperHessenberg, :UpperTriangular], 0.08703282707024972)
([:Diagonal, :UpperTriangular], 0.3072659809309098)
([:Array, :UnitUpperTriangular], 0.0801960959640439)
([:Symmetric, :UnitUpperTriangular], 0.08932123686354122)
([:Hermitian, :UnitUpperTriangular], 0.08990729018466966)
([:UpperHessenberg, :UnitUpperTriangular], 0.08850920305059527)
([:Diagonal, :UnitUpperTriangular], 0.35110059171597635)
([:Array, :LowerTriangular], 0.07920584733372955)
([:Symmetric, :LowerTriangular], 0.08806820049841861)
([:Hermitian, :LowerTriangular], 0.08689892949913336)
([:UpperHessenberg, :LowerTriangular], 0.08785087978687132)
([:Diagonal, :LowerTriangular], 0.2959498808788494)
([:Array, :UnitLowerTriangular], 0.07715129397475415)
([:Symmetric, :UnitLowerTriangular], 0.09019522967933709)
([:Hermitian, :UnitLowerTriangular], 0.08938477232681072)
([:UpperHessenberg, :UnitLowerTriangular], 0.08934690586575288)
([:Diagonal, :UnitLowerTriangular], 0.34585527043905834)
([:Diagonal, :UpperHessenberg], 0.3220858895705521)
([:Diagonal, :Tridiagonal], 0.1578947764617085)
([:Diagonal, :SymTridiagonal], 0.20002717522004712)
([:UpperTriangular, :Bidiagonal], 0.30357989568515886)
([:UnitUpperTriangular, :Bidiagonal], 0.2987819866247383)
([:LowerTriangular, :Bidiagonal], 0.2744375206094478)
([:UnitLowerTriangular, :Bidiagonal], 0.27931542328982784)
([:Diagonal, :Bidiagonal], 0.1732365783162167)
([:UpperTriangular, :Diagonal], 0.024214568979777973)
([:UnitUpperTriangular, :Diagonal], 0.03201487204047439)
([:LowerTriangular, :Diagonal], 0.026398215281650865)
([:UnitLowerTriangular, :Diagonal], 0.029608938547486034)
([:UpperHessenberg, :Diagonal], 0.32266666666666666)
([:Tridiagonal, :Diagonal], 0.19409417260880255)
([:SymTridiagonal, :Diagonal], 0.19127713498622592)
([:Bidiagonal, :Diagonal], 0.1897607790860655)
([:Diagonal, :Diagonal], 0.028365610288337475)

But there are also some cases where mul! is faster—this could just be caused by improved allocations, of course…

([:Tridiagonal, :UpperTriangular], 6.484450754024589)
([:SymTridiagonal, :UpperTriangular], 6.755345806882726)
([:Tridiagonal, :UnitUpperTriangular], 5.848837490624924)
([:SymTridiagonal, :UnitUpperTriangular], 5.85030633412128)
([:Tridiagonal, :UnitLowerTriangular], 5.854310386415167)
([:SymTridiagonal, :UnitLowerTriangular], 5.8304607385296805)
([:Tridiagonal, :LowerTriangular], 5.472315492957747)
([:SymTridiagonal, :LowerTriangular], 6.163230320159385)
([:SymTridiagonal, :Tridiagonal], 2.412737178282923)
([:Tridiagonal, :Tridiagonal], 2.2284411276948592)
([:Tridiagonal, :Tridiagonal], 2.482101850601108)
([:SymTridiagonal, :Bidiagonal], 2.6301575735495035)
([:Bidiagonal, :Bidiagonal], 2.3222368960335174)
([:Tridiagonal, :Tridiagonal], 2.4564705882352937)
([:Tridiagonal, :SymTridiagonal], 2.210109392682007)
([:SymTridiagonal, :SymTridiagonal], 2.437875751503006)
([:Bidiagonal, :SymTridiagonal], 2.491171597633136)
dkarrasch commented 1 year ago

Noticed a similar (?) issue with mul!(::Matrix, ::HermOrSym, ::HermOrSym)

I think the main reason for this is that * is allowed to simply allocate new plain arrays with easy memory access, as in

*(A::HermOrSym, B::HermOrSym) = A * copyto!(similar(parent(B)), B)

and subsequently use specialized BLAS routines. In contrast, mul!(C, ::Hermitian, ::Hermitian) falls back to generic_matmatmul!! The point is that the promise of mul! is that it operates in a memory-aware fashion, and doesn't create copies of the factors. So what we do is that we add mul! methods whenever we realize a more efficient way of doing the multiplication without creating copies.

Another aspect is that you measure computation times with output dest::Matrix. In some structured cases, this means that mul! needs to overwrite many entries by zeros, whereas * may be able to create a specific output type that has all the zeros "implicit", like *Diagonal etc. Think of *(::Diagonal, ::Diagonal), compared to mul!(::Matrix, ::Diagonal, ::Diagonal) (compared to mul!(::Diagonal, ::Diagonal, ::Diagonal)!

leostenzel commented 1 year ago

Thanks for your reply! I agree with your arguments, and many cases already seem to be addressed by your PR. It seem logical that a user should, e.g., allocate a Triagular matrix if they multiply mul!(::Triangular, ::Diagonal, ::[Unit]Triangular). Though, it's surprising that dest::StridedMatrix is 40 times slower…

I'm just not happy with mul!(::StridedMatrix, ::HermOrSym, ::HermOrSym) (and its 5-arg counterpart). Wanted to propose something like:

mul!(C::StridedMatrix{T}, A::HermOrSym{T,<:StridedMatrix},
    B::HermOrSym{T,<:StridedMatrix}) where {T<:BlasFloat} = mul!(C, copyto!(A.data, A), B)

I guess it's not ideal to mutate A… But as a user, I'd be less surprised to get symmetrized A.data, compared to a 7-fold increased runtime? Would do a corresponding PR, if you think that makes sense.