bluss / matrixmultiply

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.
https://docs.rs/matrixmultiply/
Apache License 2.0
213 stars 25 forks source link

Use optimal kernel parameters (architectures, matrix layouts) #34

Open SuperFluffy opened 6 years ago

SuperFluffy commented 6 years ago

I am trying to figure out what to use as optimal kernel parameter for different architectures.

For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a _mm256 and a _mm operation at the same time?

Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?

bluss commented 6 years ago

You must also include src/archparam.rs in this

SuperFluffy commented 5 years ago

This discussion is revealing in terms of how to determine optimal kernel parameters: https://github.com/flame/blis/issues/253

In particular, this states:

@VirtualEarth Turn your attention to Eq. 1:

  mr nr >= Nvec Lvfma Nvfma

On Intel Haswell/Broadwell/Skylake/Kabylake, Nvec is 4, Lvfma is 5, and Nvfma is 2. Thus, the register blocksize product mr nr must be at least 40 in order to overcome the floating-point latency. 6 x 8 = 48 more than satisfies this, but 8 x 6 would also be fine. The former is biased toward row-oriented SIMD output while the latter is better for column-oriented SIMD output. In BLIS, it almost never actually matters which one you pick because high-level logic will transpose the entire operation and make the output matrix C appear to be row- or column-stored, depending on the SIMD output "preference" of the microkernel.

Another interesting bit is the choice of 8x6 over 6x8 (or 8x4 over 4x8 for Sandy Bridge, which is our current implementation), which prefers column- vs row-storage in the C matrix. This then ties in with my question here: https://github.com/bluss/matrixmultiply/issues/31

mratsim commented 5 years ago

Hello there,

I'm the author of Arraymancer a Numpy + machine learning + deep learning written from scratch in Nim.

In the past month I've been building a new faster backend, with a specific focus on matrix multiplication as MKL, BLAS, BLIS were limiting my optimisation opportunities (like fusing neural network activations into GEMM or merging the im2col pass for convolution into matrix multiplication).

I've done extensive review of the literature here and added a lot of comments in my tiling code.

The most useful papers are:

[1] Anatomy of High-Performance Matrix Multiplication (Revised) Kazushige Goto, Robert A. Van de Geijn

I also keep some extra links that I didn't have time to sort.

Anyway, in terms of performance I have a generic multi-threaded BLAS (float32, float64, int32, int64) that reaches between 97~102% of OpenBLAS on my Broadwell (AVX2 and FMA) laptop depending if multithreaded/serial/float32 or float64:

Kernel parameters

Panel parameters

Proper tuning of mc and kc is very important as well. Currently I use an arbitrary 768 bytes for mc and 2048 bytes for kc. In my testing I lost up to 35% performance with other values.

There are various constraint for both, the Goto paper goes quite in-depth into them:

mratsim commented 5 years ago

I forgot to add. As mentioned in the paper Automating the last mile. You need to choose your SIMD for C updates.

You can go for shuffle/permute or broadcast and balance them to a varying degree.

To find the best you need to check from which port those instructions can be issued. Also interleave them with FMA to hide data fetching latencies.

In my code I use full broadcast and no shuffle/permute but mainly because it was simpler to reason about, I didn't test other config.

bluss commented 5 years ago

@mratsim Wow, cool to hear from you! Thanks for the links to the papers and for sharing your knowledge!

bluss commented 3 years ago

Issue #59 allows tweaking the NC, MC, KC variables easily at compile-time which is one small step and a model for further compile time tweakability.

emmatyping commented 3 years ago

Another idea which libsxmm uses is an autotuner, such as https://opentuner.org/. OpenTuner automatically evaluates the best parameters for the architecture the code is compiled on.