bluss / matrixmultiply

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.
https://docs.rs/matrixmultiply/
Apache License 2.0
209 stars 25 forks source link

Use std::arch SIMD and runtime target feature detection #22

Closed bluss closed 5 years ago

bluss commented 5 years ago

Improve performance out of the box on x86-64 by using target_feature. :1234: :fire:

Implement a fast 8x8 sgemm kernel for x86-64 using std::arch intrinsics. We maintain portability with other platforms.

  1. Multiple target_feature invocations of the same implementation allow us to compile multiple versions of the code and jump into the best one at runtime. The same simd intrinsics or Rust code would for example compile to better code inside a target_feature(enabled="avx") function than an sse function.

  2. Port a BLIS sgemm 8x8 x86 avx kernel to Rust instrinsics. Use the same memory layout in the main loop.

  3. Use std::alloc for aligned allocation

  4. Runtime target feature detection and target feature specific compilation allows us to offer native performance out of the box on x86-64 (no compiler flags needed)

  5. Test compile to multiple targets (x86-64, x86, aarch64) and run the x86 ones in travis

  6. in the f32 avx kernel, schedule loads one iteration ahead, this improves throughput

Fixes #14

bluss commented 5 years ago

Benchmarks. Comparing target-feature=+avx compiled from current master with "generic" build on this PR, and we still have improvements! This is tested on my AVX enabled ivy bridge laptop.

We see some overhead in the small sizes for sgemm - because sgemm is now requesting a 32 byte-aligned packing buffer from the system allocator, apparenly this goes through posix_memalign and is a bit slow.

before: master native-cpu build
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  138         231                   93   67.39% 
 mat_mul_f32::m007  169         271                  102   60.36% 
 mat_mul_f32::m008  180         218                   38   21.11% 
 mat_mul_f32::m009  371         489                  118   31.81% 
 mat_mul_f32::m012  458         586                  128   27.95% 
 mat_mul_f32::m016  601         544                  -57   -9.48% 
 mat_mul_f32::m032  2,984       2,451               -533  -17.86% 
 mat_mul_f32::m064  18,254      15,384            -2,870  -15.72% 
 mat_mul_f32::m127  123,472     109,885          -13,587  -11.00% 
 mat_mul_f64::m004  134         130                   -4   -2.99% 
 mat_mul_f64::m007  208         210                    2    0.96% 
 mat_mul_f64::m008  220         230                   10    4.55% 
 mat_mul_f64::m009  417         424                    7    1.68% 
 mat_mul_f64::m012  513         550                   37    7.21% 
 mat_mul_f64::m016  789         827                   38    4.82% 
 mat_mul_f64::m032  4,395       4,347                -48   -1.09% 
 mat_mul_f64::m064  28,604      28,099              -505   -1.77% 
 mat_mul_f64::m127  207,470     199,336           -8,134   -3.92% 

The most exciting part is having these gains in a "generic" build where target feature detection takes care of it. The improvement compared with master's generic build is quite big:

before: master generic build
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m127  200,461     109,885          -90,576  -45.18% 
 mat_mul_f64::m127  370,069     199,336         -170,733  -46.14%
bluss commented 5 years ago

Improved benchmarks for sgemm.

Fixed inflated allocation size bug and brought down the overhead on small matrices. Scheduling data loads early got another 10% speedup in the bigger matrices. Benchmarks are still on an avx ivy bridge laptop.

Incremental improvents from the previous benchmark to this state of the pr:

before: this pr generic build  (previously posted benchmark)
after: this pr generic build

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  231         191                  -40  -17.32% 
 mat_mul_f32::m008  218         195                  -23  -10.55% 
 mat_mul_f32::m012  586         602                   16    2.73% 
 mat_mul_f32::m016  544         526                  -18   -3.31% 
 mat_mul_f32::m032  2,451       2,259               -192   -7.83% 
 mat_mul_f32::m064  15,384      13,635            -1,749  -11.37% 
 mat_mul_f32::m127  109,885     97,045           -12,840  -11.68%

Benchmarks compared with master:

before: master native-cpu build
after: this pr generic build  (pr at this time)

 name               63 ns/iter  62 ns/iter  diff ns/iter   diff % 
 mat_mul_f32::m004  138         191                   53   38.41% 
 mat_mul_f32::m008  180         195                   15    8.33% 
 mat_mul_f32::m012  458         602                  144   31.44% 
 mat_mul_f32::m016  601         526                  -75  -12.48% 
 mat_mul_f32::m032  2,984       2,259               -725  -24.30% 
 mat_mul_f32::m064  18,254      13,635            -4,619  -25.30% 
 mat_mul_f32::m127  123,472     97,045           -26,427  -21.40%
bluss commented 5 years ago

Benchmarks vs OpenBLAS's single threaded mode, and the new sgemm (f32) kernel wins. Which is pretty shocking! :fire: For the (autovectorized) f64 kernel we "compete". Using libopenblas-base in debian version 0.2.19

test blas_mat_mul_f32::m064 ... bench:      17,496 ns/iter (+/- 322)
test blas_mat_mul_f32::m127 ... bench:     126,223 ns/iter (+/- 4,609)
test blas_mat_mul_f64::m064 ... bench:      25,466 ns/iter (+/- 319)
test blas_mat_mul_f64::m127 ... bench:     188,728 ns/iter (+/- 24,764)
test mat_mul_f32::m064      ... bench:      14,021 ns/iter (+/- 49)
test mat_mul_f32::m127      ... bench:      98,657 ns/iter (+/- 2,477)
test mat_mul_f64::m064      ... bench:      27,382 ns/iter (+/- 2,574)
test mat_mul_f64::m127      ... bench:     204,120 ns/iter (+/- 4,243)

"m064" corresponds to an all row major 64×64×64 matrix product and "m127" to 127×127×127

There should be several caveats, and I'm sure an openBlas author would know the cause much better than me. For example, the choice of blocking strategy vs available cache sizes or the size of matrix multiplication problem you optimize for. I'm not testing on a beefy machine but a laptop.

bluss commented 5 years ago

:100: