Open ksshannon opened 6 years ago
Sounds fine to me, it's been a long time since I looked into that BLAS stuff. It would be interesting to see if improvements could be made. It should be coded up to pretty much drop in I think. The matrix vector multiplication is by far the most computationally intensive function, don't forget it's stored in a compressed sparse row storage format.
I'm curious if some hand-tuned BLAS implementations would run faster for us on some hardware (OpenBLAS has arch and generation specific code, I think). I also did some simple testing, and we have our own implementation of dcopy(actually two),which the compiler doesn't optimize out until -O3 is set. I propose we introduce ninja_blas.c/h, and use the current internal implementations, or allow the user to supply one. the mkl and blas implementations will be treated separately.
/cc @jforthofer