CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
80 stars 14 forks source link

Improve performance of diagnostic edmf #2868

Open charleskawczynski opened 5 months ago

charleskawczynski commented 5 months ago

Diagnostic edmf performance is slow, and we need to identify the issue and improve the performance.

charleskawczynski commented 5 months ago

Here is a nsight report:

image

from this build: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/250 (from https://github.com/CliMA/ClimaAtmos.jl/pull/2846)

charleskawczynski commented 5 months ago

Zooming into ldiv! and set_precomputed_quantities!

image

shows that there are many, many kernel launches in set_precomputed_quantities!. So, this is likely due to the loop in set_diagnostic_edmf_precomputed_quantities_do_integral!. We can add more NVTX annotations to confirm, but this is what I suspected and it makes sense based on the report.

szy21 commented 5 months ago

Could we look at the gpu_hs_rhoe_equil_55km_nz63_0M job in that build first? Comparing that with the one in the main branch shows a significant slowdown in set_precomputed_quantities! due to get_cloud_fraction.

charleskawczynski commented 5 months ago

Yes, of course. Which two builds are we comparing? Maybe we can add the option to do set_cloud_fraction! per stage vs per step/callback, so that we can merge an example into main. I'll try to do this now.

charleskawczynski commented 5 months ago

I'm not working on this, but I did open up the nvtx report with more annotations and can confirm that the issue is the large number of kernel launches in the integral function:

image

zoomed out:

image

I need to confirm (nsight systems crashed on me), but I think these ranges are on the gpu, in which case 60% of the time spent step! is in set_diagnostic_edmf_precomputed_quantities_do_integral! (for a non-radiation step!). If not, I know that when I clicked on the range that a pretty large number of kernels was highlighted.