Open sriharshakandala opened 4 months ago
Here's more things that we could explore to mprove scaling:
MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED
:# (Optional: Enable GPU managed memory if required.)
# From ‘man mpi’: This setting will allow MPI to properly
# handle unify memory addresses. This setting has performance
# penalties as MPICH will perform buffer query on each buffer
# that is handled by MPI)
# If you see runtime errors like
# (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument,
# CUDA_ERROR_INVALID_VALUE
# make sure this variable is set
export MPICH_GPU_MANAGED_MEMORY_SUPPORT_ENABLED=1
Maybe with this we don't need JULIA_MEMORY_POOL="none"
which would allow us to use the new allocator
From the same page:
Binding MPI ranks to CPU cores can also be an important performance consideration for GPU-enabled codes, and can be done with the --cpu-bind option to mpiexec. For the above example using 2 nodes, 4 MPI ranks per node, and 1 GPU per MPI rank, binding each of the MPI ranks to one of the four separate NUMA domains within a node is likely to be optimal for performance. This could be done as follows:
mpiexec -n 8 -ppn 4 --cpu-bind verbose,list:0:16:32:48 ./set_gpu_rank ./executable_name
We should ensure that we are doing Device-to-Device communication, also when going across nodes (this might require fiddling with the RMDA protocol)
It would also be good to run a profiled run at least once to check why scaling drops
Compile scaling results for
ClimaAtmos.jl
onDerecho
supercomputer