Closed bluss closed 5 years ago
Benchmarks. Comparing target-feature=+avx compiled from current master with "generic" build on this PR, and we still have improvements! This is tested on my AVX enabled ivy bridge laptop.
We see some overhead in the small sizes for sgemm - because sgemm is now requesting a 32 byte-aligned packing buffer from the system allocator, apparenly this goes through posix_memalign and is a bit slow.
before: master native-cpu build
after: this pr generic build
name 63 ns/iter 62 ns/iter diff ns/iter diff %
mat_mul_f32::m004 138 231 93 67.39%
mat_mul_f32::m007 169 271 102 60.36%
mat_mul_f32::m008 180 218 38 21.11%
mat_mul_f32::m009 371 489 118 31.81%
mat_mul_f32::m012 458 586 128 27.95%
mat_mul_f32::m016 601 544 -57 -9.48%
mat_mul_f32::m032 2,984 2,451 -533 -17.86%
mat_mul_f32::m064 18,254 15,384 -2,870 -15.72%
mat_mul_f32::m127 123,472 109,885 -13,587 -11.00%
mat_mul_f64::m004 134 130 -4 -2.99%
mat_mul_f64::m007 208 210 2 0.96%
mat_mul_f64::m008 220 230 10 4.55%
mat_mul_f64::m009 417 424 7 1.68%
mat_mul_f64::m012 513 550 37 7.21%
mat_mul_f64::m016 789 827 38 4.82%
mat_mul_f64::m032 4,395 4,347 -48 -1.09%
mat_mul_f64::m064 28,604 28,099 -505 -1.77%
mat_mul_f64::m127 207,470 199,336 -8,134 -3.92%
The most exciting part is having these gains in a "generic" build where target feature detection takes care of it. The improvement compared with master's generic build is quite big:
before: master generic build
after: this pr generic build
name 63 ns/iter 62 ns/iter diff ns/iter diff %
mat_mul_f32::m127 200,461 109,885 -90,576 -45.18%
mat_mul_f64::m127 370,069 199,336 -170,733 -46.14%
Improved benchmarks for sgemm.
Fixed inflated allocation size bug and brought down the overhead on small matrices. Scheduling data loads early got another 10% speedup in the bigger matrices. Benchmarks are still on an avx ivy bridge laptop.
Incremental improvents from the previous benchmark to this state of the pr:
before: this pr generic build (previously posted benchmark)
after: this pr generic build
name 63 ns/iter 62 ns/iter diff ns/iter diff %
mat_mul_f32::m004 231 191 -40 -17.32%
mat_mul_f32::m008 218 195 -23 -10.55%
mat_mul_f32::m012 586 602 16 2.73%
mat_mul_f32::m016 544 526 -18 -3.31%
mat_mul_f32::m032 2,451 2,259 -192 -7.83%
mat_mul_f32::m064 15,384 13,635 -1,749 -11.37%
mat_mul_f32::m127 109,885 97,045 -12,840 -11.68%
Benchmarks compared with master:
before: master native-cpu build
after: this pr generic build (pr at this time)
name 63 ns/iter 62 ns/iter diff ns/iter diff %
mat_mul_f32::m004 138 191 53 38.41%
mat_mul_f32::m008 180 195 15 8.33%
mat_mul_f32::m012 458 602 144 31.44%
mat_mul_f32::m016 601 526 -75 -12.48%
mat_mul_f32::m032 2,984 2,259 -725 -24.30%
mat_mul_f32::m064 18,254 13,635 -4,619 -25.30%
mat_mul_f32::m127 123,472 97,045 -26,427 -21.40%
Benchmarks vs OpenBLAS's single threaded mode, and the new sgemm (f32) kernel wins. Which is pretty shocking! :fire: For the (autovectorized) f64 kernel we "compete". Using libopenblas-base in debian version 0.2.19
test blas_mat_mul_f32::m064 ... bench: 17,496 ns/iter (+/- 322)
test blas_mat_mul_f32::m127 ... bench: 126,223 ns/iter (+/- 4,609)
test blas_mat_mul_f64::m064 ... bench: 25,466 ns/iter (+/- 319)
test blas_mat_mul_f64::m127 ... bench: 188,728 ns/iter (+/- 24,764)
test mat_mul_f32::m064 ... bench: 14,021 ns/iter (+/- 49)
test mat_mul_f32::m127 ... bench: 98,657 ns/iter (+/- 2,477)
test mat_mul_f64::m064 ... bench: 27,382 ns/iter (+/- 2,574)
test mat_mul_f64::m127 ... bench: 204,120 ns/iter (+/- 4,243)
"m064" corresponds to an all row major 64×64×64 matrix product and "m127" to 127×127×127
There should be several caveats, and I'm sure an openBlas author would know the cause much better than me. For example, the choice of blocking strategy vs available cache sizes or the size of matrix multiplication problem you optimize for. I'm not testing on a beefy machine but a laptop.
:100:
Improve performance out of the box on x86-64 by using
target_feature
. :1234: :fire:Implement a fast 8x8 sgemm kernel for x86-64 using
std::arch
intrinsics. We maintain portability with other platforms.Multiple target_feature invocations of the same implementation allow us to compile multiple versions of the code and jump into the best one at runtime. The same simd intrinsics or Rust code would for example compile to better code inside a
target_feature(enabled="avx")
function than an sse function.Port a BLIS sgemm 8x8 x86 avx kernel to Rust instrinsics. Use the same memory layout in the main loop.
Use std::alloc for aligned allocation
Runtime target feature detection and target feature specific compilation allows us to offer native performance out of the box on x86-64 (no compiler flags needed)
Test compile to multiple targets (x86-64, x86, aarch64) and run the x86 ones in travis
in the f32 avx kernel, schedule loads one iteration ahead, this improves throughput
Fixes #14