Closed bluss closed 3 years ago
The lower perf difference for f64 does not mean our rust kernel is close to optimal, instead, the blis asm is not reaching what openblas can do on the same benchmark problem.
Not sure how to handle licensing. I will probably break these out to their own crate, that's nicer.
This PR is still here for future reference. Unlikely we will include them, we have already nearly matched them in performance as it is.
outdated
Add sgemm and dgemm asm microkernels from BLIS
Based on PR #2.
These go into their own little subcrate so that we can easily skip having a build script, as well as more easily explain the licensing.
The BLIS project uses a BSD 3-clause license.
The asm kernels will be opt-in and require a C compiler to build, crate gcc handles this fine. Please file a PR if you can patch this to work on more platforms.
The asm implementations especially improve the f32 case (this can be explained by it jumping from a 4-by-4 microkernel to an 8-by-8, while the f64 case uses the same 8-by-4 size for both rust and asm).
I did experiment with an 8-by-8 microkernel in plain Rust, but it was no improvement over 4-by-4.