BallisticLA / RandBLAS

A header-only C++ library for sketching in randomized linear algebra
https://randblas.readthedocs.io/en/latest/
Other
64 stars 4 forks source link

Switch to header-only implementation *AND* fix major speed issue with dense sketching #79

Closed rileyjmurray closed 6 months ago

rileyjmurray commented 6 months ago

Main reason for this PR

Very early versions of RandBLAS were header-only. We moved away from that so RNG template arguments didn't need to appear in sketching functions. However, since then I've gotten a better understanding of how templates work, and I think header-only should be doable. This PR makes just enough changes so RandBLAS is header-only and all existing tests pass. Additional tests might be needed before merging.

Important second thing in this PR

Before merging I ran the TLS_DenseSkOp example and saw the performance was awful. It turned out that our implementations of LSKGE3/RSKGE3 were calling GEMM from blas++ as blas::gemm<T>(...). The explicit specification of the template argument in this context causes blas++ to dispatch a fully-templated naive implementation of GEMM with three for-loops, which has terrible performance. The fix here to get vendor-BLAS is to just omit the template parameter, and call as blas::gemm(...).

Ping @TeachRaccooon: the fix above to our implementation of dense sketching has implications for runtime experiments you may have made with low-rank approximation algorithms. Please check if you relied on sketch_general with dense sketching operators or lskge3/rskge3.

P.S., @kaiwenhe7: I wouldn't have been able to find this bug if you hadn't created the TLS example for RandBLAS. Thanks so much for doing that!

TeachRaccooon commented 6 months ago

@rileyjmurray checked and we're using sketch_general. Also, looks like we're not using blas::gemm<T> anywhere in RandLAPACK. Should we be aware of explicit template specification in any other blas/lapack routines?

rileyjmurray commented 6 months ago

@TeachRaccooon sketch_general<T> versus sketch_general isn't the issue, it's just blas::gemm<T> vs blas::gemm. Good to know that RandLAPACK doesn't use blas::gemm<T>.

Since RandLAPACK's version of RandBLAS is prior to this commit, I can say that the dense sketching performance was unnecessarily bad. If you update to the latest RandBLAS then dense sketching will be much faster.