Compile scaling results for `ClimaAtmos.jl` on `Derecho` supercomputer

Here's more things that we could explore to mprove scaling:

MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED:

https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/starting-derecho-jobs/derecho-job-script-examples-content/?h=set_gpu

# (Optional: Enable GPU managed memory if required.)
#   From ‘man mpi’: This setting will allow MPI to properly
#   handle unify memory addresses. This setting has performance
#   penalties as MPICH will perform buffer query on each buffer
#   that is handled by MPI)
# If you see runtime errors like
# (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument,
#  CUDA_ERROR_INVALID_VALUE
# make sure this variable is set
export MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1

Maybe with this we don't need JULIA_MEMORY_POOL="none" which would allow us to use the new allocator

CPU/GPU affinity

From the same page:

Binding MPI ranks to CPU cores can also be an important performance consideration for GPU-enabled codes, and can be done with the --cpu-bind option to mpiexec. For the above example using 2 nodes, 4 MPI ranks per node, and 1 GPU per MPI rank, binding each of the MPI ranks to one of the four separate NUMA domains within a node is likely to be optimal for performance. This could be done as follows:

mpiexec -n 8 -ppn 4 --cpu-bind verbose,list:0:16:32:48 ./set_gpu_rank ./executable_name

We should ensure that we are doing Device-to-Device communication, also when going across nodes (this might require fiddling with the RMDA protocol)
It would also be good to run a profiled run at least once to check why scaling drops

CliMA / ClimaAtmos.jl

Compile scaling results for `ClimaAtmos.jl` on `Derecho` supercomputer #3161