CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
72 stars 13 forks source link

Improve prognostic implicit edmf performance #2950

Open charleskawczynski opened 2 months ago

charleskawczynski commented 2 months ago

This issue is for tracking the performance of the prognostic implicit edmf performance.

### Tasks
- [ ] Make a reproducer for the kernels discussed here: https://github.com/CliMA/ClimaAtmos.jl/pull/2951#issuecomment-2077315044
- [ ] Implement shared memory for FD kernels, see if this
- [ ] Collect launch statistics/benchmarks based on hard-coded threads/blocks vs `CUDA.launch_configuration`
- [ ] Implement and apply broadcast fusion for similar pointwise expressions
- [ ] Implement and apply broadcast fusion for similar FD expressions
charleskawczynski commented 2 months ago

I looked at the nvtx report from #2947, and this is what I'm seeing:

Here is a full step! (labeled by ClimaTimeSteppers on the left), for the big picture view:

Screen Shot 2024-04-24 at 9 25 40 AM

The 3 "large" cpu calls to remaining_tendency! / hyperdiffusion_tendency! are the ones that follow the implicit solve. They only appear large because many kernels are launched during the implicit solve, which pile up in the gpu queue, and there's a CUDA.synchronize call in hyperdiffusion_tendency!'s DSS call.

Let's zoom in to see who is responsible for these kernel launches. I'll click on the cpu functions (labeled ClimaAtmos), and the kernels (99.9% Kernels (named by NVTX) above will highlight:

dss! (2.7 ms):

Screen Shot 2024-04-24 at 9 40 36 AM

set_precomputed_quantities! (22.7 ms):

Screen Shot 2024-04-24 at 9 43 46 AM

wfact! (18.2 ms):

Screen Shot 2024-04-24 at 9 44 15 AM

implicit_tendency! (7.4 ms):

Screen Shot 2024-04-24 at 9 44 46 AM

ldiv! (107.4 ms):

Screen Shot 2024-04-24 at 9 45 08 AM

The one after that is another call to set_precomputed_quantities!.

charleskawczynski commented 2 months ago

Summary

Recap of total duration of gpu launched kernels:

Some things are obvious and clear (for a single A100 gpu):

Perhaps fortunately, ldiv! basically calls MatrixFields.field_matrix_solve!(A.solver, x, A.matrix, b), so the majority of the performance improvements can be done in ClimaCore, hopefully without even requiring changes in ClimaAtmos.