Optimize Characteristic Computations in Refactored Reconstructors

bcaddy commented 1 year ago

I realized that I was recomputing the eigenvalues for the characteristic projections multiple times for no reason in the PPMC and PLMC reconstruction kernels. This refactor changes it so that the eigenvectors are only computed once and then used in all the projections. Overall this resulted in noticeable performance increases on both a V100 system (C-3PO) and on Frontier. The performance numbers are in this table, all times given are for the entire timestep, not just the PPMC kernel. The test was the Brio & Wu shock tube at 256^3 run with PPMC.

Test	Time (ms/time step)	Registers (scalar/vector on AMD)
C-3PO Initial	189.1	157
C-3PO Final	180.4	158
C-3PO Improvement	4.6%	+1
Frontier/Crusher Initial	167.5	54/128
Frontier/Crusher Final	144.9	54/128
Frontier/Crusher Improvement	13.5%	0/0

With specifying __launch_bounds__(256) performance is about 40% improved over baseline on the MI250X

trey-ornl commented 1 year ago

The Cuda register use tells me this kernel "wants" more than 128. The AMD kernel is probably limited to 128 vector registers because of the default launch bounds of 1024 threads, which forces an occupancy of 4. You may want to add __launch_bounds__(256) to the kernel definition to see if that helps performance by allowing more register use and avoiding register spills.

bcaddy commented 1 year ago

Doing that increased register usage to 48 scalar and 256 vector registers and reduced overall time step time to 100.9ms, a 40% improvement in performance over pre-refactor code on Frontier; no statistically significant change on C-3PO. That's crazy, is there a way to tell the compiler to choose default sizes more cleverly or do we have to add __launch_bounds__(TPB) to every kernel? Alternatively, is there a smart way to see which kernels need that kind of tuning?

trey-ornl commented 1 year ago

You could compile Hip code with -Rpass-analysis=kernel-resource-usage and check the compiler output for kernels using 128 VGPRs. That means the kernel could probably profit from __launch_bounds__ to access more registers.

bcaddy commented 1 year ago

Doing that it looks like the only offenders are PLMC_VL, PPMC_VL, and Calculate_Roe_Fluxes_CUDA. We don't use the last one but I'll add the flag to the other two

cholla-hydro / cholla

Optimize Characteristic Computations in Refactored Reconstructors #306