Closed loliverhennigh closed 1 year ago
Hello Oliver,
We will soon release the performance metrics in an accompanying paper. Roughly speaking, when compared to a fully fused LBM kernel in a state-of-the-art C++ benchmark code, our version is approximately 6-7 times slower for lid-driven cavity flow. However, assuming that the BC kernel in the C++ isn't fused (which is often the case if you want to leverage the complex BCs that are available in XLB), the performance gap narrows to roughly 3-5 times (this is also the case if you compare the performance for periodic BCs, such as the performance test case in Lettuce). While I haven't run tests on V100, preliminary tests suggest it is significantly faster than Lettuce.
A major advantage is that our code has ~96% scaling efficiency on a single DGX node and maintains respectable scaling even on up to 512 GPUs. As far as I remember, Lettuce wasn't multi-GPU (or multi-node) capable.
It's worth noting that there are ongoing work to close this performance gap further by integrating Triton kernels into portions of the code.
Fantastic! This is very exciting to hear. Have you considered using either Taichi Lang or Warp for writing the kernels (https://github.com/NVIDIA/warp)? I have experience with both and found them to be particularly good for things like this. I have an LBM solver implemented in Warp and am getting same performance as FluidX3D (https://github.com/ProjectPhysX/FluidX3D). Warp also has pretty good Jax integration I think. I haven't tried implementing LBM in Taichi but have a explicit finite volume solver and it appears to be getting SOA performance although I am less confident of that. I have also tried Triton a bit but found it a little difficult to get working for this kind of work. If you do implement in Triton I will be very interested to see how it goes though :).
Sorry one more comment, if you are interested in getting the rendering stuff like in fluidX3D running I would also suggest looking at either Warp or Taichi. Implementing ray marching/tracing is kinda complicated in a tensor based framework like Jax. I can't imagine implementing it in Triton. Here is a very simple ray marching on the density contours of a FV solver in Taichi. https://www.youtube.com/watch?v=xcZcHbvMe-g.
Hey Oliver.
Thanks for your comments. In fact, we have discussed using Warp specifically for visualization tasks with NVIDIA extensively! (we are collaborating with NVIDIA JAX team on this project FYI). Warp and USD would be quite useful for this purpose, especially when dealing with simulation with multi-billion voxels.
The issue with Warp is the license agreement, which is incompatible with Apache 2.0. I have raised this issue with directors at NVIDIA, but haven't heard back of any progress.
I am happy to chat more about this if your're interested. I have added you on Linkedin or pls shoot me an email at mehdi.ataei@autodesk.com.
Any numbers we can see for performance. Been interested to see if a Jax LBM implementation can get close to optimal performance. Lettuce LBM is around 20x slower for example, https://github.com/lettucecfd/lettuce