Closed bcaddy closed 11 months ago
The Cuda register use tells me this kernel "wants" more than 128. The AMD kernel is probably limited to 128 vector registers because of the default launch bounds of 1024 threads, which forces an occupancy of 4. You may want to add __launch_bounds__(256)
to the kernel definition to see if that helps performance by allowing more register use and avoiding register spills.
Doing that increased register usage to 48 scalar and 256 vector registers and reduced overall time step time to 100.9ms, a 40% improvement in performance over pre-refactor code on Frontier; no statistically significant change on C-3PO. That's crazy, is there a way to tell the compiler to choose default sizes more cleverly or do we have to add __launch_bounds__(TPB)
to every kernel? Alternatively, is there a smart way to see which kernels need that kind of tuning?
You could compile Hip code with -Rpass-analysis=kernel-resource-usage
and check the compiler output for kernels using 128 VGPRs. That means the kernel could probably profit from __launch_bounds__
to access more registers.
Doing that it looks like the only offenders are PLMC_VL, PPMC_VL, and Calculate_Roe_Fluxes_CUDA. We don't use the last one but I'll add the flag to the other two
I realized that I was recomputing the eigenvalues for the characteristic projections multiple times for no reason in the PPMC and PLMC reconstruction kernels. This refactor changes it so that the eigenvectors are only computed once and then used in all the projections. Overall this resulted in noticeable performance increases on both a V100 system (C-3PO) and on Frontier. The performance numbers are in this table, all times given are for the entire timestep, not just the PPMC kernel. The test was the Brio & Wu shock tube at 256^3 run with PPMC.
With specifying
__launch_bounds__(256)
performance is about 40% improved over baseline on the MI250X