Closed simonbyrne closed 1 year ago
This has now been fixed by IMSS.
Note that you need to set the environment variable:
JULIA_CUDA_MEMORY_POOL=none
See #9
Okay, not quite fixed.
Looking at the profile of a 4 process run (download the nsys.tar.gz from https://buildkite.com/clima/climashallowwater-ci/builds/63#0187388e-2136-4af8-914e-fe2b1dea730b):
We can see that most of the time is dominated by the tendency evaluation, and that most of that is the DSS calls.
If we zoom into the DSS, it is clear that it isn't using direct GPU-GPU communication:
each send consists of a device-to-host copy ("DtoH memcpy"), followed by a stream synchronize, which blocks the MPI_Startall
Similarly, each receive consists of a host-to-device copy ("HtoD memcpy"), followed by a stream synchronize, which isn't even started until the MPI_Waitall
.
We need to figure out why MPI is not using direct GPU communication. @vchuravy suggested it may have something to do with which devices are visible, so we may want to use srun --gpu-bind=none
? Can also look at some of the Open MPI or UCX debug information
export OMPI_MCA_btl_base_verbose=10
export OMPI_MCA_pml_base_verbose=10
export UCX_LOG_LEVEL=info # can increase to debug/trace
export UCX_PROTO_ENABLE=y
export UCX_PROTO_INFO=y
We could launch the MPI communication on a separate thread and use blocking communication. (i.e. using Threads.@spawn
)
We could combine the buffers for both fields (u and h): this would halve the number of operations.
If we're not able to use direct GPU-GPU communication, we should implement double buffering ourselves (i.e. do a single DtoH memcpy
for all sends, and a single HtoD memcpy
for all receives).
This looks like it might be fixed
That looks indeed much better!
This looks like it might be fixed
Nice! What was the fix?
I'm not sure: when I downloaded the artifact I referred to above, it seems to be fixed, so I think I messed up?
It looks like we're not making use of GPU-GPU direct communication. Need to look at configuration