Please don't be scared by the title, and think it's going to take a few days to do :-). It should be done in less than 10 minutes. Here is the plan @chriselrod and I came up with.
search for a good kernel size
compute the cache size with an analytical model
search for a good packing strategy
[1] can be done by directly calling the packing=(Val(true), Val(true)) macro kernel with different micro_ms and micro_ns, and benchmark the macro kernel on 400 x 400 and 397 x 397 sized DGEMM (all other types can be handled by just rescaling micro_m).
[2] can be done by some formulae depend on the cache property.
[3] can be done efficiently with bisection, assuming there is one and only one crossing.
The autotuning is off by default, and one can enable it with
Please don't be scared by the title, and think it's going to take a few days to do :-). It should be done in less than 10 minutes. Here is the plan @chriselrod and I came up with.
[1] can be done by directly calling the
packing=(Val(true), Val(true))
macro kernel with differentmicro_m
s andmicro_n
s, and benchmark the macro kernel on400 x 400
and397 x 397
sized DGEMM (all other types can be handled by just rescalingmicro_m
).[2] can be done by some formulae depend on the cache property.
[3] can be done efficiently with bisection, assuming there is one and only one crossing.
The autotuning is off by default, and one can enable it with