Open charleskawczynski opened 5 months ago
Here is a nsight report:
from this build: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/250 (from https://github.com/CliMA/ClimaAtmos.jl/pull/2846)
Zooming into ldiv!
and set_precomputed_quantities!
shows that there are many, many kernel launches in set_precomputed_quantities!
. So, this is likely due to the loop in set_diagnostic_edmf_precomputed_quantities_do_integral!
. We can add more NVTX annotations to confirm, but this is what I suspected and it makes sense based on the report.
Could we look at the gpu_hs_rhoe_equil_55km_nz63_0M
job in that build first? Comparing that with the one in the main branch shows a significant slowdown in set_precomputed_quantities!
due to get_cloud_fraction
.
Yes, of course. Which two builds are we comparing? Maybe we can add the option to do set_cloud_fraction!
per stage vs per step/callback, so that we can merge an example into main. I'll try to do this now.
I'm not working on this, but I did open up the nvtx report with more annotations and can confirm that the issue is the large number of kernel launches in the integral function:
zoomed out:
I need to confirm (nsight systems crashed on me), but I think these ranges are on the gpu, in which case 60% of the time spent step!
is in set_diagnostic_edmf_precomputed_quantities_do_integral!
(for a non-radiation step!
). If not, I know that when I clicked on the range that a pretty large number of kernels was highlighted.
Diagnostic edmf performance is slow, and we need to identify the issue and improve the performance.