Open charleskawczynski opened 7 months ago
I've removed the prototype (as we already have developed https://github.com/CliMA/MultiBroadcastFusion.jl, which has performance tests) to reduce the noise in this issue.
I'm pleasantly surprised that the generic/recursive pattern appears (somehow) more performant than the hard-coded one, but I'll take it!
Really nice and helpful. Thank you!
This issue is a continuation of CliMA/ClimaAtmos.jl#635, but I'm excluding some items (some addressed, and others which I've explained in CliMA/ClimaAtmos.jl#635) to reduce the noise.
Memory access patterns
We should make sure that we inline all kernels, use shared/local memory when possible, and ensure we have coalesced reads/writes.
Reducing loads and stores
The primary point of improving performance beyond our current state is by reducing the number of memory loads and stores. One way to do that is by fusing operations, which can allow the compiler to hoist (and eliminate) memory loads/stores. Another way is to explicitly pass less data through broadcast expressions (where possible).
There are a few different options / paths to capturing some of this performance that we've left on the table, and each approach has its limitations, pros and cons:
@fuse begin @. a = b; @. c = d end
)$^1$ It's important to note that one can nullify the other. That is, if we perform two optimizations: 1) eliminate loading X from kernel A and eliminate loading Y from kernel B 2) fuse kernels A and B
we could end up with the same number of loads and stores if we had only performed optimization 1) or 2) alone.
Removing unnecessary work
We can remove unnecessary work, e.g., in precomputed quantities, or using a caching system
Parallelism
There are other optimizations we can perform, which can also have a notable impact. For example, parallelizing work, reducing allocations to reduce the frequency of GC, reducing MPI communication, and emitting more efficient low-level code. Below is a list of some of these items:
Scaling
Minimize number of dss calls, and gc calls.
Misc
There are other miscilaneous items, specified in the task list.