Closed rileyjmurray closed 6 months ago
@rileyjmurray checked and we're using sketch_general
.
Also, looks like we're not using blas::gemm<T>
anywhere in RandLAPACK.
Should we be aware of explicit template specification in any other blas/lapack routines?
@TeachRaccooon sketch_general<T>
versus sketch_general
isn't the issue, it's just blas::gemm<T>
vs blas::gemm
. Good to know that RandLAPACK doesn't use blas::gemm<T>
.
Since RandLAPACK's version of RandBLAS is prior to this commit, I can say that the dense sketching performance was unnecessarily bad. If you update to the latest RandBLAS then dense sketching will be much faster.
Main reason for this PR
Very early versions of RandBLAS were header-only. We moved away from that so RNG template arguments didn't need to appear in sketching functions. However, since then I've gotten a better understanding of how templates work, and I think header-only should be doable. This PR makes just enough changes so RandBLAS is header-only and all existing tests pass. Additional tests might be needed before merging.
Important second thing in this PR
Before merging I ran the
TLS_DenseSkOp
example and saw the performance was awful. It turned out that our implementations of LSKGE3/RSKGE3 were calling GEMM from blas++ asblas::gemm<T>(...)
. The explicit specification of the template argument in this context causes blas++ to dispatch a fully-templated naive implementation of GEMM with three for-loops, which has terrible performance. The fix here to get vendor-BLAS is to just omit the template parameter, and call asblas::gemm(...)
.Ping @TeachRaccooon: the fix above to our implementation of dense sketching has implications for runtime experiments you may have made with low-rank approximation algorithms. Please check if you relied on
sketch_general
with dense sketching operators orlskge3
/rskge3
.P.S., @kaiwenhe7: I wouldn't have been able to find this bug if you hadn't created the TLS example for RandBLAS. Thanks so much for doing that!