CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
79 stars 14 forks source link

Performance roadmap #2632

Open charleskawczynski opened 7 months ago

charleskawczynski commented 7 months ago

This issue is a continuation of CliMA/ClimaAtmos.jl#635, but I'm excluding some items (some addressed, and others which I've explained in CliMA/ClimaAtmos.jl#635) to reduce the noise.

Memory access patterns

We should make sure that we inline all kernels, use shared/local memory when possible, and ensure we have coalesced reads/writes.

Reducing loads and stores

The primary point of improving performance beyond our current state is by reducing the number of memory loads and stores. One way to do that is by fusing operations, which can allow the compiler to hoist (and eliminate) memory loads/stores. Another way is to explicitly pass less data through broadcast expressions (where possible).

There are a few different options / paths to capturing some of this performance that we've left on the table, and each approach has its limitations, pros and cons:

$^1$ It's important to note that one can nullify the other. That is, if we perform two optimizations: 1) eliminate loading X from kernel A and eliminate loading Y from kernel B 2) fuse kernels A and B

we could end up with the same number of loads and stores if we had only performed optimization 1) or 2) alone.

Removing unnecessary work

We can remove unnecessary work, e.g., in precomputed quantities, or using a caching system

Parallelism

There are other optimizations we can perform, which can also have a notable impact. For example, parallelizing work, reducing allocations to reduce the frequency of GC, reducing MPI communication, and emitting more efficient low-level code. Below is a list of some of these items:

Scaling

Minimize number of dss calls, and gc calls.

Misc

There are other miscilaneous items, specified in the task list.

### Tasks
- [x] Minimize the number of DSS calls. There are some ungrouped weighted DSS calls that can easily be grouped together. This is extremely low hanging fruit, and should probably be done asap. https://github.com/CliMA/ClimaAtmos.jl/pull/2689. We can still group dss calls in `dss_hyperdiffusion_tendency!` and `dss_tracer_hyperdiffusion_tendency!` if we fuse `T_exp!` and `T_lim!`. Done in https://github.com/CliMA/ClimaAtmos.jl/pull/2758
- [ ] GC pauses can happen at different places on different processes, which will hamper scaling efficiency (as processes will wait for other processes running GC): once we've reduced allocations we can disable the GC and trigger it manually intermittently. `GC.enable(false)` and `GC.gc()` to trigger manually
- [ ] Measure performance of new diagnostics. We've added a callbacks flame graph, which shows time spent during diagnostics, but we should use more realistic parameters, like resolution and frequency of different diagnostics.
- [x] Use `*` over `/` if not being optimized **low impact, very little effort** https://github.com/CliMA/ClimaCore.jl/pull/1496
- [ ] Custom writing highly expensive kernels. We could try writing custom kernels to improve performance of specific kernels.
- [ ] Improve RRTMGP performance (https://github.com/CliMA/RRTMGP.jl/issues/389 seems important)
- [ ] https://github.com/CliMA/ClimaTimeSteppers.jl/issues/233
- [ ] https://github.com/CliMA/ClimaTimeSteppers.jl/issues/247
- [x] Collect kernel stats for all cuda kernels on A100 https://github.com/CliMA/ClimaCore.jl/pull/1729, https://github.com/CliMA/ClimaTimeSteppers.jl/pull/277
- [x] Create prototype example to demo performance benefit of `@fuse` (without implementing the macro)
- [x] Reduce number of DSS calls https://github.com/CliMA/ClimaAtmos.jl/pull/2689
- [x] Group dss calls in `dss_hyperdiffusion_tendency!` and `dss_tracer_hyperdiffusion_tendency!` by fusing `T_exp!` and `T_lim!`. Done in https://github.com/CliMA/ClimaAtmos.jl/pull/2758.
- [x] Always inline in ClimaCore's broadcast kernels https://github.com/CliMA/ClimaCore.jl/pull/1647
- [x] Implement similar-space point-wise fused kernels https://github.com/CliMA/ClimaCore.jl/pull/1641
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1739
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1740
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1741
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1742
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1743
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1744
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1745
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1738
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1746
- [ ] https://github.com/CliMA/ClimaTimeSteppers.jl/issues/270
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1747
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1748
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/11
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1753
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1754
- [ ] https://github.com/CliMA/ClimaCore.jl/issues/1910
charleskawczynski commented 6 months ago

I've removed the prototype (as we already have developed https://github.com/CliMA/MultiBroadcastFusion.jl, which has performance tests) to reduce the noise in this issue.

I'm pleasantly surprised that the generic/recursive pattern appears (somehow) more performant than the hard-coded one, but I'll take it!

tapios commented 6 months ago

Really nice and helpful. Thank you!