Special packing for complex, specialize packing for avx2

bluss / matrixmultiply

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.

Apache License 2.0

209 stars 25 forks source link

Complex (cgemm, zgemm):

Use a different pack layout for complex micorkernels which puts real and imag parts in separate rows. This enables much better autovectorization for the fallback kernels.

Also enable an Avx2 + Fma autovectorized kernel.

Performance improvements (all kernels autovectorized for cgemm, zgemm at this time)

AArch64 NEON (apple m1): cgemm: +60%, zgemm +10%
Fma + Avx (Intel Tiger lake) cgemm: +143%
Fma + Avx2 (Intel Tiger lake) cgemm: +395% (new), zgemm: +77% (new)

Float (sgemm, dgemm):

When the kernels can now select their own packing functions, instantiate an avx2 version of the general packing function for sgemm and dgemm.

Packing performance matters most for small matrix multiplications, for bigger sizes it is a vanishingly small part of runtime.

Avx2 (Intel Tiger Lake): sgemm improves 6-15%, dgemm improves 0-8% depending on input layouts. Tested on M, K, N = 32, i.e a small matrix.

avx2 packing bench

``` name nobeta-avx-before1 ns/iter nobeta-avx-after1 ns/iter diff ns/iter diff % speedup layout_f32_032::nobeta_ccc 1,257 1,143 -114 -9.07% x 1.10 layout_f32_032::nobeta_ccf 1,251 1,140 -111 -8.87% x 1.10 layout_f32_032::nobeta_cfc 1,445 1,259 -186 -12.87% x 1.15 layout_f32_032::nobeta_cff 1,441 1,255 -186 -12.91% x 1.15 layout_f32_032::nobeta_fcc 1,080 1,020 -60 -5.56% x 1.06 layout_f32_032::nobeta_fcf 1,074 1,018 -56 -5.21% x 1.06 layout_f32_032::nobeta_ffc 1,287 1,147 -140 -10.88% x 1.12 layout_f32_032::nobeta_fff 1,280 1,142 -138 -10.78% x 1.12 layout_f64_032::nobeta_ccc 1,761 1,783 22 1.25% x 0.99 layout_f64_032::nobeta_ccf 1,760 1,776 16 0.91% x 0.99 layout_f64_032::nobeta_cfc 1,920 1,839 -81 -4.22% x 1.04 layout_f64_032::nobeta_cff 1,914 1,830 -84 -4.39% x 1.05 layout_f64_032::nobeta_fcc 1,636 1,581 -55 -3.36% x 1.03 layout_f64_032::nobeta_fcf 1,632 1,572 -60 -3.68% x 1.04 layout_f64_032::nobeta_ffc 1,766 1,634 -132 -7.47% x 1.08 layout_f64_032::nobeta_fff 1,760 1,627 -133 -7.56% x 1.08 ```

bluss / matrixmultiply

Special packing for complex, specialize packing for avx2 #75