bluss / matrixmultiply

General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.
https://docs.rs/matrixmultiply/
Apache License 2.0
209 stars 25 forks source link

Add sgemm and dgemm asm microkernels from BLIS #3

Closed bluss closed 3 years ago

bluss commented 8 years ago

Add sgemm and dgemm asm microkernels from BLIS

Based on PR #2.

These go into their own little subcrate so that we can easily skip having a build script, as well as more easily explain the licensing.

The BLIS project uses a BSD 3-clause license.

The asm kernels will be opt-in and require a C compiler to build, crate gcc handles this fine. Please file a PR if you can patch this to work on more platforms.

The asm implementations especially improve the f32 case (this can be explained by it jumping from a 4-by-4 microkernel to an 8-by-8, while the f64 case uses the same 8-by-4 size for both rust and asm).

I did experiment with an 8-by-8 microkernel in plain Rust, but it was no improvement over 4-by-4.

name                           rust.log ns/iter  avxasm.log ns/iter    diff ns/iter   diff %
mat_mul_f32::m127              207,187           135,118                    -72,069  -34.78%
mat_mul_f32::mix128x10000x128  16,396,382        10,856,725              -5,539,657  -33.79%
mat_mul_f64::m016              1,103             1,044                          -59   -5.35%
mat_mul_f64::m064              38,887            35,027                      -3,860   -9.93%
mat_mul_f64::m127              279,235           249,761                    -29,474  -10.56%
mat_mul_f64::mix128x10000x128  22,934,117        20,833,112              -2,101,005   -9.16%
bluss commented 8 years ago

The lower perf difference for f64 does not mean our rust kernel is close to optimal, instead, the blis asm is not reaching what openblas can do on the same benchmark problem.

bluss commented 8 years ago

Not sure how to handle licensing. I will probably break these out to their own crate, that's nicer.

bluss commented 5 years ago

This PR is still here for future reference. Unlikely we will include them, we have already nearly matched them in performance as it is.

bluss commented 3 years ago

outdated