Closed szy21 closed 4 months ago
Update on this: the performance of buoyancy_gradients
is poor, however, the full story is that two kernels are expensive, from calling cloud_fraction_model_callback!
:
The NVTX trace, showing the relative performance hit compared to other high level kernels:
And the flame graph, showing what's being called with more granularity:
So, the buoyancy_gradients
is about 33% responsible for the cost of adding the call of cloud_fraction_model_callback!
, and the quad_loop
call is about 63% responsible. There are a few other broadcast expressions, but they are relatively inexpensive, so I'll focus on these two.
I think, to start with, we should simply break up buoyancy_gradients
and try to see if/when there's a catastrophic performance improvement. Just eye-balling it, I'd say that we should hoist out ᶜgradᵥ(ᶠinterp(ϕ))
for both broadcast expressions, and see where that gets us. At the very least, we could reuse these precomputed quantities.
We may need to apply some RootSolvers/Thermodynamics optimizations (e.g., https://github.com/CliMA/RootSolvers.jl/pull/51) to reduce the arithmetic intensity of those kernels (they are very expensive).
Also, next, I'll add a gpu job and see what the performance of this kernel looks like on the gpu, since that's probably a more important target to optimize.
cc @szy21, @trontrytel, @tapios
That's interesting, thanks! I reached the conclusion that buoyancy_gradients
is more expensive than quad_loop
as when I commented out buoyancy_gradients
the time-to-solution is similar to commenting out the entire cloud_fraction_model_callback!
, when I first noticed the problem. But maybe that's not an accurate way to estimate it, or maybe we improved the performance of buoyancy_gradients
since then.
Sounds good. If there is anything easy to optimize in Thermodynamics, I think we should start with it. We are calling each function 9 times when running with quadrature points. So even small improvements should show up
Here are some updates (cc @szy21):
Could you post the time-to-solution for the job with and without cloud diagnostics on GPU? Other than that I'm ok with closing this issue. Thanks for all the work!
Yep, from this PR (with 1 p100 gpu):
[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB)
[ Info: sypd: 5.259918262132573
[ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds
So, it's actually not bad.
Yep, from this PR (with 1 p100 gpu):
[ Info: solve!: 450.030 s (149709528 allocations: 19.45 GiB) [ Info: sypd: 5.259918262132573 [ Info: wall_time_per_timestep: 234 milliseconds, 390 microseconds
So, it's actually not bad.
And how about the one without cloud diagnostics on GPU?
Good question, I'll convert the other one, too in that PR so that we can compare.
Without cloud diagnostics, we have:
[ Info: solve!: 442.539 s (150425525 allocations: 19.44 GiB)
[ Info: sypd: 5.348959731633479
[ Info: wall_time_per_timestep: 230 milliseconds, 489 microseconds
cc @szy21
Great, thanks!
@tapios mentioned that this is still an issue, so I'm reopening.
We should get updated numbers.
For cloud diagnostics, the issue is not directly buoyancy gradients, but gradientes of moisture/enthalpy. This may be related to the buoyancy gradient issue though. @szy21 knows more.
The latest build has 2.03 SYPD for held suarez and 1.47 SYPD for held suarez with cloud diagnostics per stage, so ~30% difference (assuming they are using the same GPU on central). I don't know whether the slowdown is mostly from compute_gm_mixing_length!
(which includes buoyancy_gradients) or quad_loop
.
The buoyancy gradients itself is now only 547 μs (xref: https://github.com/CliMA/ClimaAtmos.jl/pull/2951#issuecomment-2077315044). Closing.
2456 makes all the dycore simulations slower by a factor of two. When looking at it more closely, it seems most of the slowdown is from buoyancy_gradients. As a temporary solution, the mixing length and thus buoyancy gradients calculation is moved to a callback in #2466 without EDMF. With EDMF, this is still called at each timestep. It would be good if we can improve the performance of buoyancy_gradients. cc @charleskawczynski