bluss / matrixmultiply

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.
https://docs.rs/matrixmultiply/
Apache License 2.0
209 stars 25 forks source link

More flexible kernel selection #37

Closed bluss closed 5 years ago

bluss commented 5 years ago

Allow separate GemmKernel implementations per feature, so that we can tweak kernel parameters per feature. This also allows us to restore the performance of the fallback kernels.

We introduce a selector trait and when entering for example sgemm, we pass control to sgemm_kernel::detect, and it selects which kernel implementation to use. Control passes back to the gemm module and it launches the gemm_loop using the selected kernel.

This approach allows us to insert a cache (ifunc strategy #23) in the gemm module if we want to cache the function pointer to avoid running detection many times. But it's unlikely that we are going to need that.

This is also an improvement since we don't need to run detection every kernel invocation! Repeated detection is just an atomic load and compare, so it's pretty cheap, but anyway.

This change slightly increases the amount of compiled code due to instantiating the gemm loop with multiple kernels, but it's a relatively little difference in compile time (adds +25% compile time in a crate that is already compiling very quickly, less than 3 seconds in release mode on my configuration).

We also add some ad-hoc types for static dimensions, so that the matrix packing function is instantiated appropriately for each kernel width, now that we have more diverse such.

SuperFluffy commented 5 years ago

What I am currently unhappy with is that our definition of KernelAvx or KernelFma (once the other PR is merged) all assume Sandy Bridge as the optimal architecture by hardcoding 8x4 or 8x8 kernels, respectively. These are however not necessarily optimal for each architecture. For example, if I recall correctly Haswell has two FMA gates (I hope I am using the right word), while Zen only has one. So the optimal kernel between the two would be slightly different.

The same is true for the kernel sizes themselves, which depend on the latency of the various operations and the number of operations that can be executed simultaneously.

Does it make sense to introduce compile time flags to prefer certain kernels?

bluss commented 5 years ago

@SuperFluffy this PR enables custom MR/NR and all other parameters, so I'm happy with that as a good first step. We can now resolve to any specific GemmKernel at runtime/compiletime inside dgemm_kernel::select. And we'll move things out to files named by architecture soon as well.

I'd look at if and how we can detect microarch and configuration specific parameters (cache sizes) at build or runtime. I'm wary of only two things, optimizing for computers we don't have, that doesn't make much sense to me, and going too deep into this — I don't set out to replace BLAS or BLIS, but to have good performance by relatively simple and maintainable means.

Again, I think it's most interesting with performance that's enabled by default/automatically, that's what's actually going to be used most of the time. That said, if we get more advanced options, there's going to be need for enabling/disabling it.

This PR is also important for restoring performance on non-x86 since we can get separate settings for the fallback kernels.

bluss commented 5 years ago

This PR seems to give a very small improvement for sgemm avx on my configuration. Probably due to amortizing the feature detection, or changed inlining decisions.

 name                 63 ns/iter  62 ns/iter  diff ns/iter  diff % 
 layout_f32_032::ccc  2,025       1,984                -41  -2.02% 
 layout_f32_032::ccf  2,029       1,989                -40  -1.97% 
 layout_f32_032::cfc  2,273       2,239                -34  -1.50% 
 layout_f32_032::cff  2,279       2,247                -32  -1.40% 
 layout_f32_032::fcc  1,773       1,726                -47  -2.65% 
 layout_f32_032::fcf  1,772       1,731                -41  -2.31% 
 layout_f32_032::ffc  2,029       2,040                 11   0.54% 
 layout_f32_032::fff  2,034       2,032                 -2  -0.10% 
 mat_mul_f32::m004    211         210                   -1  -0.47% 
 mat_mul_f32::m006    211         209                   -2  -0.95% 
 mat_mul_f32::m008    164         165                    1   0.61% 
 mat_mul_f32::m012    531         529                   -2  -0.38% 
 mat_mul_f32::m016    488         462                  -26  -5.33% 
 mat_mul_f32::m032    2,060       1,987                -73  -3.54% 
 mat_mul_f32::m064    13,231      13,076              -155  -1.17% 
 mat_mul_f32::m127    95,668      93,970            -1,698  -1.77%

But the best improvement is that each kernel, including the fallback kernel can now make its own inlining decisions, size decisions, and they can all optimize better (and autovectorize, if applicable).