Matrix multiplication logic poor for ForwardDiff.Dual

ryanelandt commented 6 years ago

StaticArrays has heuristics that determine what code to make to multiply matrices. There seems to have a heuristic for BlasFloat and one for everything else (i.e.Any). The present Any heuristic makes bad choices for Dual. I think that it would be straightforward to create a better heuristic for ForwardDiff.Dual by including the number of partials in the heuristic. The obvious issue with this is that it would require ForwardDiff to be a dependency to StaticArrays. Is there a good way to get this performance issue fixed?

using BenchmarkTools
using ForwardDiff
using StaticArrays

Type_Dual = ForwardDiff.Dual{Float64,Float64,26}

A = rand(SMatrix{4,4,Type_Dual,16})
B = rand(SMatrix{4,4,Type_Dual,16})

@btime $A * $B  # DEFAULT
# 1.376 μs (0 allocations: 0 bytes)

@btime StaticArrays.mul_loop($(Size(A)),$(Size(B)),$A,$B)
# 614.142 ns (0 allocations: 0 bytes)

@btime StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$A,$B)
# 688.962 ns (0 allocations: 0 bytes)

@btime StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$A,$B)
# 1.382 μs (0 allocations: 0 bytes)

tkoolen commented 6 years ago

Very interesting, thanks for looking into this.

The obvious issue with this is that it would require ForwardDiff to be a dependency to StaticArrays

Yes, especially since this dependency would be circular. Should StaticArrays maybe support some kind of type trait interface that ForwardDiff can then implement?

tkoolen commented 6 years ago

By the way,

Type_Dual = ForwardDiff.Dual{Float64,Float64,26}

The first type parameter of a Dual is the tag type. While it's possible to use Float64 as the tag type and it shouldn't matter for this benchmark, I think it's unusual and could potentially result in perturbation confusion; just FYI.

andyferris commented 6 years ago

Interesting.

I wonder if there is someting dirt simple we can do with sizeof(eltype(staticarray))? E.g. replace https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/matrix_multiply.jl#L75 with

if sa[1]*sa[2]*sb[2]*sizeof(Ta)*sizeof(Tb) <= 8

given that Float64 is 8 bytes and the cutoffs were (once upon a time, at least) reasonable for such matrices.

In reality, yes we need a bunch of traits... regarding the BlasFloat code for instance we need a trait to indicate mutable arrays with dense layout - now we only use a predifined list of array types. (Honestly, I had been punting on that, waiting for Base to introduce some basic array traits). In this case we could potentially introduce a broadcast-like "Multiplication Style" to determine the algorithm?

ryanelandt commented 6 years ago

Thanks for the correction, I always assumed the first was the real type and the second was the partial type. Should I put Nothing in it by default?

Anyway, I ran some more benchmarks and I think the best solution is to change the StaticArrays heuristic to always call mul_loop when the data type is Any for the reasons I will explain in this paragraph. The functions mul_unrolled_chunks and mul_unrolled speed up calculations for floats because of SIMD. Neither mul_unrolled_chunks or mul_unrolled actually speeds up calculations when the data is not floats (see new benchmark below). Also mul_unrolled_chunks has an obvious bug which suggests it doesn't get used much.

using BenchmarkTools
using StaticArrays
using SymPy

@syms x
int64_ones = @SMatrix ones(Int64, 4, 4)
for data_example = (1, x)
    A = int64_ones * data_example
    B = deepcopy(A)
    println("testing for: ", typeof(data_example))
    @btime StaticArrays.mul_loop($(Size(A)),$(Size(B)),$A,$B)
    @btime StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$A,$B)
    @btime StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$A,$B)
end

# testing for: Int64
#   19.738 ns (0 allocations: 0 bytes)
#   27.225 ns (0 allocations: 0 bytes)
#   19.596 ns (0 allocations: 0 bytes)
# testing for: Sym
#   84.293 μs (450 allocations: 8.91 KiB)
#   83.791 μs (466 allocations: 9.41 KiB)
#   84.167 μs (450 allocations: 8.91 KiB)

A = SMatrix{4,4,Any,16}(x, ones(15)...)
# This line will error
StaticArrays.mul_unrolled_chunks(Size(A), Size(A), A, A)

c42f commented 5 years ago

I just checked and this is still a problem on julia-1.1.

tkoolen commented 5 years ago

Let's get https://github.com/JuliaArrays/StaticArrays.jl/pull/514 merged?

JuliaArrays / StaticArrays.jl

Matrix multiplication logic poor for ForwardDiff.Dual #513