Open ryanelandt opened 6 years ago
Very interesting, thanks for looking into this.
The obvious issue with this is that it would require ForwardDiff to be a dependency to StaticArrays
Yes, especially since this dependency would be circular. Should StaticArrays maybe support some kind of type trait interface that ForwardDiff can then implement?
By the way,
Type_Dual = ForwardDiff.Dual{Float64,Float64,26}
The first type parameter of a Dual
is the tag type. While it's possible to use Float64
as the tag type and it shouldn't matter for this benchmark, I think it's unusual and could potentially result in perturbation confusion; just FYI.
Interesting.
I wonder if there is someting dirt simple we can do with sizeof(eltype(staticarray))
? E.g. replace https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/matrix_multiply.jl#L75 with
if sa[1]*sa[2]*sb[2]*sizeof(Ta)*sizeof(Tb) <= 8
given that Float64
is 8 bytes and the cutoffs were (once upon a time, at least) reasonable for such matrices.
In reality, yes we need a bunch of traits... regarding the BlasFloat
code for instance we need a trait to indicate mutable arrays with dense layout - now we only use a predifined list of array types. (Honestly, I had been punting on that, waiting for Base
to introduce some basic array traits). In this case we could potentially introduce a broadcast-like "Multiplication Style" to determine the algorithm?
Thanks for the correction, I always assumed the first was the real type and the second was the partial type. Should I put Nothing in it by default?
Anyway, I ran some more benchmarks and I think the best solution is to change the StaticArrays heuristic to always call mul_loop
when the data type is Any
for the reasons I will explain in this paragraph. The functions mul_unrolled_chunks
and mul_unrolled
speed up calculations for floats because of SIMD. Neither mul_unrolled_chunks
or mul_unrolled
actually speeds up calculations when the data is not floats (see new benchmark below). Also mul_unrolled_chunks
has an obvious bug which suggests it doesn't get used much.
using BenchmarkTools
using StaticArrays
using SymPy
@syms x
int64_ones = @SMatrix ones(Int64, 4, 4)
for data_example = (1, x)
A = int64_ones * data_example
B = deepcopy(A)
println("testing for: ", typeof(data_example))
@btime StaticArrays.mul_loop($(Size(A)),$(Size(B)),$A,$B)
@btime StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$A,$B)
@btime StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$A,$B)
end
# testing for: Int64
# 19.738 ns (0 allocations: 0 bytes)
# 27.225 ns (0 allocations: 0 bytes)
# 19.596 ns (0 allocations: 0 bytes)
# testing for: Sym
# 84.293 μs (450 allocations: 8.91 KiB)
# 83.791 μs (466 allocations: 9.41 KiB)
# 84.167 μs (450 allocations: 8.91 KiB)
A = SMatrix{4,4,Any,16}(x, ones(15)...)
# This line will error
StaticArrays.mul_unrolled_chunks(Size(A), Size(A), A, A)
I just checked and this is still a problem on julia-1.1.
Let's get https://github.com/JuliaArrays/StaticArrays.jl/pull/514 merged?
StaticArrays
has heuristics that determine what code to make to multiply matrices. There seems to have a heuristic forBlasFloat
and one for everything else (i.e.Any
). The presentAny
heuristic makes bad choices forDual
. I think that it would be straightforward to create a better heuristic forForwardDiff.Dual
by including the number of partials in the heuristic. The obvious issue with this is that it would requireForwardDiff
to be a dependency toStaticArrays
. Is there a good way to get this performance issue fixed?