YingboMa / MaBLAS.jl

Other
25 stars 0 forks source link

Autotuning #9

Open YingboMa opened 4 years ago

YingboMa commented 4 years ago

Please don't be scared by the title, and think it's going to take a few days to do :-). It should be done in less than 10 minutes. Here is the plan @chriselrod and I came up with.

  1. search for a good kernel size
  2. compute the cache size with an analytical model
  3. search for a good packing strategy

[1] can be done by directly calling the packing=(Val(true), Val(true)) macro kernel with different micro_ms and micro_ns, and benchmark the macro kernel on 400 x 400 and 397 x 397 sized DGEMM (all other types can be handled by just rescaling micro_m).

[2] can be done by some formulae depend on the cache property.

[3] can be done efficiently with bisection, assuming there is one and only one crossing.

The autotuning is off by default, and one can enable it with

ENV["AUTOTUNE_MABLAS"] = true
] build MaBLAS