szy21 commented 8 months ago

2456 makes all the dycore simulations slower by a factor of two. When looking at it more closely, it seems most of the slowdown is from buoyancy_gradients. As a temporary solution, the mixing length and thus buoyancy gradients calculation is moved to a callback in #2466 without EDMF. With EDMF, this is still called at each timestep. It would be good if we can improve the performance of buoyancy_gradients. cc @charleskawczynski

### Tasks
- [x] Reduce number of times that local geometry is explicitly passed as an argument (https://github.com/CliMA/ClimaAtmos.jl/pull/2367, https://github.com/CliMA/ClimaAtmos.jl/pull/2544)
- [x] Analyze the performance of `cloud_fraction_model_callback!`
- [x] Add cloud diagnostics to flame graph (#2668)
- [x] Hoist out ᶜgradᵥ(ᶠinterp(ϕ)) for both broadcast expressions, and see where that gets us. Done in #2673
- [x] Reduce size of structs to reduce memory throughput https://github.com/CliMA/ClimaAtmos.jl/pull/2680.
- [x] Understand and report why performance was so poor for these kernels (perhaps open issues in ClimaCore). Answer: I think the quad loop is simply compute-intensive (9 SA calls is a lot), and there are no metric terms, so this kernel is relatively simple on the ClimaCore side. For the buoyancy gradients, we could see some improvement by using shared memory (right now, we double the number of loads/stores in the FD operators for 2-point stencils).
- [ ] (optional) Apply RootSolvers/Thermodynamics optimizations (e.g., CliMA/RootSolvers.jl#51) to reduce the arithmetic intensity of those kernels (they are very expensive)
- [ ] Add GPU job that exercises `cloud_fraction_model_callback!`: https://github.com/CliMA/ClimaAtmos.jl/pull/2684
- [x] (optional) Inline all thermo functions to allow for the compiler to perform Common Subexpression Elimination (CSE). On the CPU, it's required that we first fix https://github.com/CliMA/ClimaCore.jl/issues/1602 so that `PhasePartition` is inlined. We already always inline on the GPU, so this will have no impact there.

charleskawczynski commented 7 months ago

Update on this: the performance of buoyancy_gradients is poor, however, the full story is that two kernels are expensive, from calling cloud_fraction_model_callback!:

The NVTX trace, showing the relative performance hit compared to other high level kernels:

And the flame graph, showing what's being called with more granularity:

So, the buoyancy_gradients is about 33% responsible for the cost of adding the call of cloud_fraction_model_callback!, and the quad_loop call is about 63% responsible. There are a few other broadcast expressions, but they are relatively inexpensive, so I'll focus on these two.

I think, to start with, we should simply break up buoyancy_gradients and try to see if/when there's a catastrophic performance improvement. Just eye-balling it, I'd say that we should hoist out ᶜgradᵥ(ᶠinterp(ϕ)) for both broadcast expressions, and see where that gets us. At the very least, we could reuse these precomputed quantities.

We may need to apply some RootSolvers/Thermodynamics optimizations (e.g., https://github.com/CliMA/RootSolvers.jl/pull/51) to reduce the arithmetic intensity of those kernels (they are very expensive).

Also, next, I'll add a gpu job and see what the performance of this kernel looks like on the gpu, since that's probably a more important target to optimize.

cc @szy21, @trontrytel, @tapios

szy21 commented 7 months ago

That's interesting, thanks! I reached the conclusion that buoyancy_gradients is more expensive than quad_loop as when I commented out buoyancy_gradients the time-to-solution is similar to commenting out the entire cloud_fraction_model_callback!, when I first noticed the problem. But maybe that's not an accurate way to estimate it, or maybe we improved the performance of buoyancy_gradients since then.

trontrytel commented 7 months ago

Sounds good. If there is anything easy to optimize in Thermodynamics, I think we should start with it. We are calling each function 9 times when running with quadrature points. So even small improvements should show up

charleskawczynski commented 7 months ago

Here are some updates (cc @szy21):

I've converted the cloud diagnostics job into a gpu job (https://github.com/CliMA/ClimaAtmos.jl/pull/2684), and the kernel is significantly cheaper (our gpu implementation is much better than our cpu implementation). So, one solution is to just switch more jobs over to the gpu.
The quad loop has optimization room in Thermodynamics / RootSolvers (by doing less work), which will improve cpu and gpu performance
The buoyancy gradients kernel may have optimization room in using shared memory for our FD kernels. This will mean fewer reads/writes, which are serial on the cpu, and parallel on the gpu (i.e., improve global memory traffic). Using shared memory for our FD kernels is already on our perf road map: https://github.com/CliMA/ClimaAtmos.jl/issues/2632, so it's not just the buoyancy gradients kernel that will benefit.

szy21 commented 6 months ago

Could you post the time-to-solution for the job with and without cloud diagnostics on GPU? Other than that I'm ok with closing this issue. Thanks for all the work!

charleskawczynski commented 6 months ago

Yep, from this PR (with 1 p100 gpu):

[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds

So, it's actually not bad.

szy21 commented 6 months ago

Yep, from this PR (with 1 p100 gpu):

[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds

So, it's actually not bad.

And how about the one without cloud diagnostics on GPU?

charleskawczynski commented 6 months ago

Good question, I'll convert the other one, too in that PR so that we can compare.

charleskawczynski commented 6 months ago

Without cloud diagnostics, we have:

[ Info: solve!: 442.539 s (150425525 allocations: 19.44 GiB)
[ Info: sypd: 5.348959731633479
[ Info: wall_time_per_timestep: 230 milliseconds, 489 microseconds

cc @szy21

szy21 commented 6 months ago

Great, thanks!

charleskawczynski commented 5 months ago

@tapios mentioned that this is still an issue, so I'm reopening.

charleskawczynski commented 5 months ago

We should get updated numbers.

tapios commented 5 months ago

For cloud diagnostics, the issue is not directly buoyancy gradients, but gradientes of moisture/enthalpy. This may be related to the buoyancy gradient issue though. @szy21 knows more.

szy21 commented 5 months ago

The latest build has 2.03 SYPD for held suarez and 1.47 SYPD for held suarez with cloud diagnostics per stage, so ~30% difference (assuming they are using the same GPU on central). I don't know whether the slowdown is mostly from compute_gm_mixing_length! (which includes buoyancy_gradients) or quad_loop.

charleskawczynski commented 4 months ago

The buoyancy gradients itself is now only 547 μs (xref: https://github.com/CliMA/ClimaAtmos.jl/pull/2951#issuecomment-2077315044). Closing.

CliMA / ClimaAtmos.jl

Improve the performance of buoyancy_gradients #2530