Question: can optimize code to take advantage of “large caches” on Ada GPUs and RDNA2?

Hi @oscarbg,

this is a super interesting question. In short: unfortunately no.

The lattice Boltzmann algorithm works with these density distribution functions (DDFs) fi; these are floating-point numbers and reside in VRAM, typically 19 for each grid cell. At the largest possible grid resolution, they fill almost the entire VRAM, several GigaByte. In every time step, each and every one of these DDFs is read once from VRAM, modified in the LBM collision step, and written back to memory. So LBM plows over the enire VRAM in every single time step. The arithmetic in the collision step is so little that the memory access time completely hides it - the algorithm operate in the bandwidth limit.

Modern GPUs now have ~16GB VRAM and "large" ~128MB of fast L2/L3 cache. FluidX3D through OpenCL uses this cache to the full possible extent: The DDFs that fit into the cache do not have to be written in "slow" VRAM. But 128MB is only 0.8% of 16GB, and for all the remaining 99.2% of the DDFs, the slower VRAM has to be used. At the largest possible resolution, if the L2/L3 cache bandwidth is 5x faster than VRAM, the performance improvement due to cache is 0.6%, barely measurable. If the simulation box is large and VRAM demand is significantly larger than cache size, there is no benefit.

It is not possible to compute the same small chunk of VRAM repeatedly with caching - LBM is a non-cacheable algorithm.

However: If the simulation box is tiny, 128³ cells fitting in 128MB VRAM with FP16S/C memory compression, the entire LBM algorithm operates in fast L2 cache and performance is 5x when L2/L3 bandwidth is 5x of VRAM bandwidth. For cases where very low resolution is sufficient, the cache offers large speedup, and benchmarks at low resolution show this.

So it depends on the application. In most cases though, you want as large grid resolution as possible, and then VRAM capacity and bandwidth are all that counts, while cache size is irrelevant.

Regards, Moritz

ProjectPhysX / FluidX3D

Question: can optimize code to take advantage of “large caches” on Ada GPUs and RDNA2? #64