Autotuning - Githubissues

Please don't be scared by the title, and think it's going to take a few days to do :-). It should be done in less than 10 minutes. Here is the plan @chriselrod and I came up with.

search for a good kernel size
compute the cache size with an analytical model
search for a good packing strategy

[1] can be done by directly calling the packing=(Val(true), Val(true)) macro kernel with different micro_ms and micro_ns, and benchmark the macro kernel on 400 x 400 and 397 x 397 sized DGEMM (all other types can be handled by just rescaling micro_m).

[2] can be done by some formulae depend on the cache property.

[3] can be done efficiently with bisection, assuming there is one and only one crossing.

The autotuning is off by default, and one can enable it with

ENV["AUTOTUNE_MABLAS"] = true
] build MaBLAS

YingboMa / MaBLAS.jl

Autotuning #9