ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.77k stars 300 forks source link

MRT Misconception #77

Closed ivan-pi closed 1 year ago

ivan-pi commented 1 year ago

The FAQs state that:

Although MRT can be implemented in an efficient manner with only a single matrix-vector multiplication in registers, leading to identical performance compared to SRT by remaining bandwidth-bound, storing the matrices vastly elongates and over-complicates the code for no real benefit.

This statement does not reflect state of the art in LBM. If you look at Table 4.1 in the article

Hennig, F., Holzer, M., & Rüde, U. (2022). Advanced Automatic Code Generation for Multiple Relaxation-Time Lattice Boltzmann Methods. arXiv preprint arXiv:2211.02435.

it is possible to develop MRT kernels with roughly the same FLOPS count as BGK. The idea is you don't store the full 19-by-19 matrices, but just directly inline the non-zero coefficients, applying (symbolic) common-subexpression elimination along the way.

But I agree the numerical properties of MRT are difficult to explain or generalize as shown by recent theoretical works, e.g. https://arxiv.org/pdf/2104.14217.pdf, and don't necessarily improve accuracy or stability.

Btw, maybe you'll find this work interesting too: https://arxiv.org/pdf/2304.06437.pdf. I have had quite good results with regularized LBM methods (that was using higher order equilibria featuring third order terms). One downside on cache-based CPU's was the (over-)complicated collision, but this recent paper shows how to solve this, by "flipping" the purpose of PDF's and macroscopic variables. It's also considerably simpler to implement than compared too these advanced streaming schemes.

ProjectPhysX commented 1 year ago

Hi @ivan-pi,

thanks for sharing! I previously had implemented MRT by embedding only one 19x19 matrix Q := (M-1·S·M) in the OpenCL code and unrolling the single matrix-vector multiplication. The compiler would then already eliminate the multiplications by 0. The resulting assembly is equivalent to what you would get with code generation. Overall FLops count is significantly larger than SRT, but still well in the bandwidth limit, so there is no performance impact. What I meant in the FAQs was that the matrices M fro the velocity sets have to be hardcoded at least in the C++ part, and this is a few hundred lines extra, for no benefit in the end.

I'll have an eye on the thread-safe LBM variant!

Regards, Moritz