Closed bluss closed 5 years ago
What I am currently unhappy with is that our definition of KernelAvx
or KernelFma
(once the other PR is merged) all assume Sandy Bridge as the optimal architecture by hardcoding 8x4
or 8x8
kernels, respectively. These are however not necessarily optimal for each architecture. For example, if I recall correctly Haswell has two FMA gates (I hope I am using the right word), while Zen only has one. So the optimal kernel between the two would be slightly different.
The same is true for the kernel sizes themselves, which depend on the latency of the various operations and the number of operations that can be executed simultaneously.
Does it make sense to introduce compile time flags to prefer certain kernels?
@SuperFluffy this PR enables custom MR/NR and all other parameters, so I'm happy with that as a good first step. We can now resolve to any specific GemmKernel at runtime/compiletime inside dgemm_kernel::select
. And we'll move things out to files named by architecture soon as well.
I'd look at if and how we can detect microarch and configuration specific parameters (cache sizes) at build or runtime. I'm wary of only two things, optimizing for computers we don't have, that doesn't make much sense to me, and going too deep into this — I don't set out to replace BLAS or BLIS, but to have good performance by relatively simple and maintainable means.
Again, I think it's most interesting with performance that's enabled by default/automatically, that's what's actually going to be used most of the time. That said, if we get more advanced options, there's going to be need for enabling/disabling it.
This PR is also important for restoring performance on non-x86 since we can get separate settings for the fallback kernels.
This PR seems to give a very small improvement for sgemm avx on my configuration. Probably due to amortizing the feature detection, or changed inlining decisions.
name 63 ns/iter 62 ns/iter diff ns/iter diff %
layout_f32_032::ccc 2,025 1,984 -41 -2.02%
layout_f32_032::ccf 2,029 1,989 -40 -1.97%
layout_f32_032::cfc 2,273 2,239 -34 -1.50%
layout_f32_032::cff 2,279 2,247 -32 -1.40%
layout_f32_032::fcc 1,773 1,726 -47 -2.65%
layout_f32_032::fcf 1,772 1,731 -41 -2.31%
layout_f32_032::ffc 2,029 2,040 11 0.54%
layout_f32_032::fff 2,034 2,032 -2 -0.10%
mat_mul_f32::m004 211 210 -1 -0.47%
mat_mul_f32::m006 211 209 -2 -0.95%
mat_mul_f32::m008 164 165 1 0.61%
mat_mul_f32::m012 531 529 -2 -0.38%
mat_mul_f32::m016 488 462 -26 -5.33%
mat_mul_f32::m032 2,060 1,987 -73 -3.54%
mat_mul_f32::m064 13,231 13,076 -155 -1.17%
mat_mul_f32::m127 95,668 93,970 -1,698 -1.77%
But the best improvement is that each kernel, including the fallback kernel can now make its own inlining decisions, size decisions, and they can all optimize better (and autovectorize, if applicable).
Allow separate GemmKernel implementations per feature, so that we can tweak kernel parameters per feature. This also allows us to restore the performance of the fallback kernels.
We introduce a selector trait and when entering for example
sgemm
, we pass control tosgemm_kernel::detect
, and it selects which kernel implementation to use. Control passes back to the gemm module and it launches thegemm_loop
using the selected kernel.This approach allows us to insert a cache (ifunc strategy #23) in the gemm module if we want to cache the function pointer to avoid running detection many times. But it's unlikely that we are going to need that.
This is also an improvement since we don't need to run detection every kernel invocation! Repeated detection is just an atomic load and compare, so it's pretty cheap, but anyway.
This change slightly increases the amount of compiled code due to instantiating the gemm loop with multiple kernels, but it's a relatively little difference in compile time (adds +25% compile time in a crate that is already compiling very quickly, less than 3 seconds in release mode on my configuration).
We also add some ad-hoc types for static dimensions, so that the matrix packing function is instantiated appropriately for each kernel width, now that we have more diverse such.