Implement DGEMM kernel using avx intrinsics

SuperFluffy commented 6 years ago

This is a first working version of the dgemm kernel using avx intrinsics. I cannot yet see a difference between the autovectorized version and this one. Bit of a bummer.

Todo:

[x] Implement the same kernel, but with _mm256_shuffle_pd for row-major matrices.

bluss commented 6 years ago

This looks nice. I think it's important that we set always_masked to false here, that way it can reach its true potential and read/write directly to the C matrix. Would be interesting to see benchmarks, in comparison with the fallback version.

I'd focus on the regular "cargo bench" build without RUSTFLAGS set first, but it's as you want.

bluss commented 6 years ago

Thanks for pushing the code. i had to look at it :)

SuperFluffy commented 6 years ago

The latest commit now supports the special case where matrix c is row-major/has a column stride of 1, csc==1`.

If I see things correctly, the test cases are all for row-major matrices. This means, that the code passes the integration tests! It's only lacking units tests, currently.

bluss commented 6 years ago

On master, there is a more comprehensive test of various layouts.

Do both of the c/f layout cases pay off in benchmarks?

SuperFluffy commented 6 years ago

name                 bench_only_colmajor ns/iter  bench_with_rowmajor ns/iter  diff ns/iter  diff %  speedup 
 layout_f64_032::ccc  3,378                        3,229                                -149  -4.41%   x 1.05 
 layout_f64_032::ccf  3,121                        3,179                                  58   1.86%   x 0.98 
 layout_f64_032::cfc  3,561                        3,406                                -155  -4.35%   x 1.05 
 layout_f64_032::cff  3,323                        3,364                                  41   1.23%   x 0.99 
 layout_f64_032::fcc  3,200                        3,055                                -145  -4.53%   x 1.05 
 layout_f64_032::fcf  2,949                        2,980                                  31   1.05%   x 0.99 
 layout_f64_032::ffc  3,367                        3,215                                -152  -4.51%   x 1.05 
 layout_f64_032::fff  3,119                        3,153                                  34   1.09%   x 0.99 
 mat_mul_f64::m004    161                          162                                     1   0.62%   x 0.99 
 mat_mul_f64::m006    236                          245                                     9   3.81%   x 0.96 
 mat_mul_f64::m008    261                          251                                   -10  -3.83%   x 1.04 
 mat_mul_f64::m012    501                          504                                     3   0.60%   x 0.99 
 mat_mul_f64::m016    647                          600                                   -47  -7.26%   x 1.08 
 mat_mul_f64::m032    3,421                        3,213                                -208  -6.08%   x 1.06 
 mat_mul_f64::m064    22,352                       21,392                               -960  -4.29%   x 1.04 
 mat_mul_f64::m127    163,414                      157,834                            -5,580  -3.41%   x 1.04

The actual matrix multiplication got a nice little boost, I think.

You can see that all cases where c is c-major (::**c in the benches) all have a consistent boost of about 4%. I think that's quite cool!

SuperFluffy commented 5 years ago

I rebased this on top of master and squashed commits as I saw fit. I think the remaining ones are general enough that they can stand on their own.

Tests are also passing. Do you have any other request, @bluss?

bluss commented 5 years ago

That's great, all tests passing is of course the goal. Will merge when I get back home.

I have a plan for how to split up kernel selection better, so that we can have separate sizes and parameters for each feature. Will implement an evening when I have time.

SuperFluffy commented 5 years ago

I saw that you starred https://github.com/millardjn/matrixmultiply_mt in my github feed.

I like their code organization. It seems very clean on a first glance.

bluss commented 5 years ago

Thanks for this contribution. This improves dgemm avx by a lot, I mean, especially the exciting fma addition.

We regress the fallback kernel by a lot in performance as well, and we'll have to look at how to solve that. I think separating GemmKernel implementations for each of them would be good. My proof of concept of that seems to have the intended effect (keeping special feature perf and restoring fallback perf), so this is safe to merge...

bluss / matrixmultiply

Implement DGEMM kernel using avx intrinsics #33