ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.74k stars 295 forks source link

Question: can optimize code to take advantage of “large caches” on Ada GPUs and RDNA2? #64

Closed oscarbg closed 1 year ago

oscarbg commented 1 year ago

Hi, I have seen old computemark benchmark fluid 3d texture array to perform more than 50% faster on 4090 vs 3090 despite similar BW.. seems benchmark is bw bound because it’s a simple 3d stencil method wave simulator and perhaps your lbm method can’t be optimized same way.. or can be? Just asking if caches from 48M in 4070 ti up to 128MB in Rdna2 can be exploited by localizing more the memory accesses to hit more the caches.. It’s sad to see 4090 marginally faster than 3090 in fluidx3d.. Thanks..

ProjectPhysX commented 1 year ago

Hi @oscarbg,

this is a super interesting question. In short: unfortunately no.

The lattice Boltzmann algorithm works with these density distribution functions (DDFs) fi; these are floating-point numbers and reside in VRAM, typically 19 for each grid cell. At the largest possible grid resolution, they fill almost the entire VRAM, several GigaByte. In every time step, each and every one of these DDFs is read once from VRAM, modified in the LBM collision step, and written back to memory. So LBM plows over the enire VRAM in every single time step. The arithmetic in the collision step is so little that the memory access time completely hides it - the algorithm operate in the bandwidth limit.

Modern GPUs now have ~16GB VRAM and "large" ~128MB of fast L2/L3 cache. FluidX3D through OpenCL uses this cache to the full possible extent: The DDFs that fit into the cache do not have to be written in "slow" VRAM. But 128MB is only 0.8% of 16GB, and for all the remaining 99.2% of the DDFs, the slower VRAM has to be used. At the largest possible resolution, if the L2/L3 cache bandwidth is 5x faster than VRAM, the performance improvement due to cache is 0.6%, barely measurable. If the simulation box is large and VRAM demand is significantly larger than cache size, there is no benefit.

It is not possible to compute the same small chunk of VRAM repeatedly with caching - LBM is a non-cacheable algorithm.

However: If the simulation box is tiny, 128³ cells fitting in 128MB VRAM with FP16S/C memory compression, the entire LBM algorithm operates in fast L2 cache and performance is 5x when L2/L3 bandwidth is 5x of VRAM bandwidth. For cases where very low resolution is sufficient, the cache offers large speedup, and benchmarks at low resolution show this.

So it depends on the application. In most cases though, you want as large grid resolution as possible, and then VRAM capacity and bandwidth are all that counts, while cache size is irrelevant.

Regards, Moritz