cholla-hydro / cholla

A GPU-based hydro code
https://github.com/cholla-hydro/cholla/wiki
MIT License
60 stars 32 forks source link

Optimize Characteristic Computations in Refactored Reconstructors #306

Closed bcaddy closed 11 months ago

bcaddy commented 1 year ago

I realized that I was recomputing the eigenvalues for the characteristic projections multiple times for no reason in the PPMC and PLMC reconstruction kernels. This refactor changes it so that the eigenvectors are only computed once and then used in all the projections. Overall this resulted in noticeable performance increases on both a V100 system (C-3PO) and on Frontier. The performance numbers are in this table, all times given are for the entire timestep, not just the PPMC kernel. The test was the Brio & Wu shock tube at 256^3 run with PPMC.

Test Time (ms/time step) Registers (scalar/vector on AMD)
C-3PO Initial 189.1 157
C-3PO Final 180.4 158
C-3PO Improvement 4.6% +1
Frontier/Crusher Initial 167.5 54/128
Frontier/Crusher Final 144.9 54/128
Frontier/Crusher Improvement 13.5% 0/0

With specifying __launch_bounds__(256) performance is about 40% improved over baseline on the MI250X

trey-ornl commented 1 year ago

The Cuda register use tells me this kernel "wants" more than 128. The AMD kernel is probably limited to 128 vector registers because of the default launch bounds of 1024 threads, which forces an occupancy of 4. You may want to add __launch_bounds__(256) to the kernel definition to see if that helps performance by allowing more register use and avoiding register spills.

bcaddy commented 1 year ago

Doing that increased register usage to 48 scalar and 256 vector registers and reduced overall time step time to 100.9ms, a 40% improvement in performance over pre-refactor code on Frontier; no statistically significant change on C-3PO. That's crazy, is there a way to tell the compiler to choose default sizes more cleverly or do we have to add __launch_bounds__(TPB) to every kernel? Alternatively, is there a smart way to see which kernels need that kind of tuning?

trey-ornl commented 1 year ago

You could compile Hip code with -Rpass-analysis=kernel-resource-usage and check the compiler output for kernels using 128 VGPRs. That means the kernel could probably profit from __launch_bounds__ to access more registers.

bcaddy commented 1 year ago

Doing that it looks like the only offenders are PLMC_VL, PPMC_VL, and Calculate_Roe_Fluxes_CUDA. We don't use the last one but I'll add the flag to the other two