demianmnave / CML

The Configurable Math Library
https://github.com/demianmnave/CML
Boost Software License 1.0
83 stars 15 forks source link

CML matrix multiplication is terribly inefficient #20

Open egorodet opened 4 years ago

egorodet commented 4 years ago

CML matrix multiplication is based on naive algorithm which does not take into account any cache memory optimizations. As the result matrix multiplications becomes bottleneck in any real-world applications using CML. It would be nice to implement at least some memory optimizations (https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0). Also it would be great to use SIMD intrinsics (https://codereview.stackexchange.com/questions/101144/simd-matrix-multiplication) if not too hard to implement in template code.

demianmnave commented 4 years ago

Sincere apologies for not responding to this. For some reason, I do not receive new issue notifications.

Have you observed slowdown in a real application? If so, by how much and for what size matrices?

In general, CML is targeted at small, often-changing matrices. Please correct me if I am wrong, but as I recall, cache-conscious matrix product won't make much difference unless the matrices are larger than the L1 and/or L2 cache. Consequently, I'd imagine that one of the cache-optimized BLAS implementations would be more appropriate outside of typical CML use cases.

Similarly for SIMD, if you've observed a slowdown in a real application, I'd be very interested. I haven't verified recently, but for small, fixed-size vectors and matrices, loops should be compiler-unrolled automatically. Moreover, on Intel x64, SSE is required for floating-point arithmetic, and later Visual Studio releases should auto-parallelize the loops (this may require a special command-line argument, I don't recall at the moment). I can verify this and make sure I'm not doing anything to get in the compiler's way.

Finally, I'm very much interested in these sorts of optimizations, so if you have performance data and/or code that supports your request, that would be fantastic and most welcome. (Of course, pull requests with supporting performance tests would also be welcome.)

Thank you! Demian

egorodet commented 4 years ago

Hello Demian,

I've found this performance issue during optimization of the graphics sample demonstrating parallel rendering, which is a part of Methane Kit project. Methane Kit is a modern graphics API abstraction library and it is using CML as a base vector math library.

Asteroids sample demonstrates parallel rendering up to 50000 asteroids in realtime with animation updates on each frame. It was found that sample is CPU-bound because of animation updates are too slow even with multi-threaded execution. AsteroidsArray::Update function updates parameters of 50000 asteroids on each frame, including calculation of model and model-view-projection matrices, which are produced as the result of 4 multiplications of matrix44f_r for each asteroid. This results in 200000 matrix multiplications per AsteroidsArray::Update call on each frame executed in parallel 16 threads on my Intel Core i7-7820X in 16.2 ms, which is obviously to much for 60 FPS frame budget, if you want to do some render comands encoding besides calculating animations (sample was running at 40 FPS). Program was compiled on Windows 10 1909 with VS2019 MSVC 14.26 using standard CMake Release configuration flags.

To fix this I've implemented SSE-optimized template specialization of the matrix product for matrix44f_r - see modified version of cml/matrix/matrix_product.tpp. With optimized implementation in place the same AsteroidsArray::Update now takes 4.3 ms, having almost 4x performance boost on this function call. As the result sample is now running at 85 FPS with 50000 asteroids. I believe that it is possible to accelerate vector/matrix calculations on CPU even further in case of using fully SEE-optimized math library like hlslpp, but it lacks some functionality available in CML, so it is still on hold.

I think that CML is great C++ vector math library with extensive functionality, but it lacks crtically needed SSE optimizations to be used in high-performance and realtime applications. Hope that you could add these optimizations in future!


With Best Regards, Evgeny.

demianmnave commented 4 years ago

Evgeny, thank you very much for the details. I will definitely look into it.

Just out of curiosity, was the slowdown in the 32-bit or 64-bit build of the Asteroids sample (or both)?

Thanks again! Demian