Closed SuperFluffy closed 5 years ago
This looks nice. I think it's important that we set always_masked to false here, that way it can reach its true potential and read/write directly to the C matrix. Would be interesting to see benchmarks, in comparison with the fallback version.
I'd focus on the regular "cargo bench" build without RUSTFLAGS set first, but it's as you want.
Thanks for pushing the code. i had to look at it :)
The latest commit now supports the special case where matrix c is row-major/has a column stride of 1, csc==1`.
If I see things correctly, the test cases are all for row-major matrices. This means, that the code passes the integration tests! It's only lacking units tests, currently.
On master, there is a more comprehensive test of various layouts.
Do both of the c/f layout cases pay off in benchmarks?
name bench_only_colmajor ns/iter bench_with_rowmajor ns/iter diff ns/iter diff % speedup
layout_f64_032::ccc 3,378 3,229 -149 -4.41% x 1.05
layout_f64_032::ccf 3,121 3,179 58 1.86% x 0.98
layout_f64_032::cfc 3,561 3,406 -155 -4.35% x 1.05
layout_f64_032::cff 3,323 3,364 41 1.23% x 0.99
layout_f64_032::fcc 3,200 3,055 -145 -4.53% x 1.05
layout_f64_032::fcf 2,949 2,980 31 1.05% x 0.99
layout_f64_032::ffc 3,367 3,215 -152 -4.51% x 1.05
layout_f64_032::fff 3,119 3,153 34 1.09% x 0.99
mat_mul_f64::m004 161 162 1 0.62% x 0.99
mat_mul_f64::m006 236 245 9 3.81% x 0.96
mat_mul_f64::m008 261 251 -10 -3.83% x 1.04
mat_mul_f64::m012 501 504 3 0.60% x 0.99
mat_mul_f64::m016 647 600 -47 -7.26% x 1.08
mat_mul_f64::m032 3,421 3,213 -208 -6.08% x 1.06
mat_mul_f64::m064 22,352 21,392 -960 -4.29% x 1.04
mat_mul_f64::m127 163,414 157,834 -5,580 -3.41% x 1.04
The actual matrix multiplication got a nice little boost, I think.
You can see that all cases where c is c-major (::**c
in the benches) all have a consistent boost of about 4%. I think that's quite cool!
I rebased this on top of master and squashed commits as I saw fit. I think the remaining ones are general enough that they can stand on their own.
Tests are also passing. Do you have any other request, @bluss?
That's great, all tests passing is of course the goal. Will merge when I get back home.
I have a plan for how to split up kernel selection better, so that we can have separate sizes and parameters for each feature. Will implement an evening when I have time.
I saw that you starred https://github.com/millardjn/matrixmultiply_mt in my github feed.
I like their code organization. It seems very clean on a first glance.
Thanks for this contribution. This improves dgemm avx by a lot, I mean, especially the exciting fma addition.
We regress the fallback kernel by a lot in performance as well, and we'll have to look at how to solve that. I think separating GemmKernel implementations for each of them would be good. My proof of concept of that seems to have the intended effect (keeping special feature perf and restoring fallback perf), so this is safe to merge...
This is a first working version of the dgemm kernel using avx intrinsics. I cannot yet see a difference between the autovectorized version and this one. Bit of a bummer.
Todo:
_mm256_shuffle_pd
for row-major matrices.