Closed kaipartmann closed 3 days ago
I can reproduce those results in macOS. I found some interesting scenarios based on the proposed MWE:
If I use this version, I see all the allocations:
function mwe2(a, X, n)
local K
for i in 1:n
k = a * i
K = k * X * X'
end
return K
end
julia> @btime mwe2(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000)
163.540 ms (7000000 allocations: 289.92 MiB)
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
10.0 10.0 10.0
10.0 10.0 10.0
10.0 10.0 10.0
However, if I suppress the local variable k
, everything works:
function mwe3(a, X, n)
local K
for i in 1:n
K = a * i * X * X'
end
return K
end
julia> @btime mwe3(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000)
1.916 ns (0 allocations: 0 bytes)
3×3 SMatrix{3, 3, Float64, 9} with indices SOneTo(3)×SOneTo(3):
10.0 10.0 10.0
10.0 10.0 10.0
10.0 10.0 10.0
This is an inlining change
│ %31 = invoke LinearAlgebra.broadcast(LinearAlgebra.:*::typeof(*), %29::Float64, X::SVector{3, Float64}, %30::Vararg{Any})::SMatrix{3, 3, Float64, 9}
no longer gets inlined and we allocate because of it.
Changing the code to this
function mwe1(a, X, n)
K = zeros(SMatrix{3,3})
for i in 1:n
k = a * i
K1 = k * X
K += @inline K1* X'
end
return K
end
fixes it and its actually better
Hi @gbaraldi !
no longer gets inlined and we allocate because of it.
Will those allocations happen on every call to that function (scalar x vector product) ? Or does it depend on the scenario? If so, I think I might experience a huge amount of performance degradation in our simulations. Is it something that might be reverted or fixed in 1.11?
@gbaraldi what's the point with your example? As far as I see it's using different code paths so it seems irrelevant here?
To be specific, the MWE is using a three-argument *
method, while your example only calls two-argument *
methods.
I don't remember in which release cycle (could well be in v1.11), but we introduced three-arg *
specializations, and that "generic" choice seems to be a bad one for static arrays, as changing the order of multiplication makes a gigantic difference (mwe1
vs mwe3
in the OP). I'm not sure if we want to change the order of multiplication generically, or whether this should be specified by the user.
3-arg *
was new in Julia 1.7. That did have a small negative effect on square SMatrices, a few ns, and at the time I made a PR to address this omitting the size-checks, https://github.com/JuliaArrays/StaticArrays.jl/pull/919 (not merged).
Edit: sorry this is scalar-vector-adjointvec, which now calls broadcast(*, k, X, X')
. Not sure whether there was any regression on 1.7 for SVectors. The earlier behaviour would have been K += (k * X) * X'
, which seems as fast as mwe3
's K += k * (X * X')
now.
I am a bit surprised that the version with a * i * X * X'
is fast, as this does not call a 4-arg *
method, but instead to (a * i) * X * X'
and then the same 3-arg *
via broadcast
. But perhaps the inlining is different.
Time flies. 🤦
@nsajko: @gbaraldi's quote stems from an inspection of @code_typed mwe1(1e-5, SVector{3}(1.0, 1.0, 1.0), 1_000_000)
, and the corresponding broadcast
call is absent on v1.10. Instead, one finds the inlined code for all the products and additions there. So, indeed, it seems that inlining thresholds have changed, and splitting the multiplications (as he suggested) improves performance beyond what one can achieve with setting brackets explicitly (EDIT: the @inline
annotation doesn't even seem to make a difference for me).
Should we close this, or is there anything actionable here?
Closing - but please reopen if necessary.
As also described in https://github.com/JuliaArrays/StaticArrays.jl/issues/1282, there is a large performance regression with
v1.11
when usingStaticArrays
:Interestingly, the problem can be solved by changing
k * X * X'
tok * (X * X')
: