Open charleskawczynski opened 2 months ago
I looked at the nvtx report from #2947, and this is what I'm seeing:
Here is a full step!
(labeled by ClimaTimeSteppers
on the left), for the big picture view:
The 3 "large" cpu calls to remaining_tendency!
/ hyperdiffusion_tendency!
are the ones that follow the implicit solve. They only appear large because many kernels are launched during the implicit solve, which pile up in the gpu queue, and there's a CUDA.synchronize
call in hyperdiffusion_tendency!
's DSS call.
Let's zoom in to see who is responsible for these kernel launches. I'll click on the cpu functions (labeled ClimaAtmos
), and the kernels (99.9% Kernels (named by NVTX)
above will highlight:
dss!
(2.7 ms):
set_precomputed_quantities!
(22.7 ms):
wfact!
(18.2 ms):
implicit_tendency!
(7.4 ms):
ldiv!
(107.4 ms):
The one after that is another call to set_precomputed_quantities!
.
Recap of total duration of gpu launched kernels:
dss!
(2.7 ms)set_precomputed_quantities!
(22.7 ms)wfact!
(18.2 ms)implicit_tendency!
(7.4 ms)ldiv!
(107.4 ms)step!
(1.59 seconds)Some things are obvious and clear (for a single A100 gpu):
ldiv!
.set_precomputed_quantities!
, not far behind is wfact!
, then implicit_tendency!
, then dss!
buoyancy_gradients
(in set_precomputed_quantities!
), which takes about 8 msPerhaps fortunately, ldiv!
basically calls MatrixFields.field_matrix_solve!(A.solver, x, A.matrix, b)
, so the majority of the performance improvements can be done in ClimaCore, hopefully without even requiring changes in ClimaAtmos.
This issue is for tracking the performance of the prognostic implicit edmf performance.