CliMA / ClimaShallowWater.jl

3 stars 1 forks source link

Look into issues with CUDA-aware MPI #8

Closed simonbyrne closed 1 year ago

simonbyrne commented 1 year ago

It looks like we're not making use of GPU-GPU direct communication. Need to look at configuration

simonbyrne commented 1 year ago

This has now been fixed by IMSS.

Note that you need to set the environment variable:

JULIA_CUDA_MEMORY_POOL=none

See #9

simonbyrne commented 1 year ago

Okay, not quite fixed.

simonbyrne commented 1 year ago

Looking at the profile of a 4 process run (download the nsys.tar.gz from https://buildkite.com/clima/climashallowwater-ci/builds/63#0187388e-2136-4af8-914e-fe2b1dea730b):

Screenshot 2023-03-31 at 3 00 27 PM

We can see that most of the time is dominated by the tendency evaluation, and that most of that is the DSS calls.

If we zoom into the DSS, it is clear that it isn't using direct GPU-GPU communication:

Screenshot 2023-03-31 at 3 02 35 PM

each send consists of a device-to-host copy ("DtoH memcpy"), followed by a stream synchronize, which blocks the MPI_Startall

Screenshot 2023-03-31 at 3 02 44 PM

Similarly, each receive consists of a host-to-device copy ("HtoD memcpy"), followed by a stream synchronize, which isn't even started until the MPI_Waitall.

  1. We need to figure out why MPI is not using direct GPU communication. @vchuravy suggested it may have something to do with which devices are visible, so we may want to use srun --gpu-bind=none? Can also look at some of the Open MPI or UCX debug information

    export OMPI_MCA_btl_base_verbose=10
    export OMPI_MCA_pml_base_verbose=10
    export UCX_LOG_LEVEL=info # can increase to debug/trace
    export UCX_PROTO_ENABLE=y
    export UCX_PROTO_INFO=y
  2. We could launch the MPI communication on a separate thread and use blocking communication. (i.e. using Threads.@spawn)

  3. We could combine the buffers for both fields (u and h): this would halve the number of operations.

  4. If we're not able to use direct GPU-GPU communication, we should implement double buffering ourselves (i.e. do a single DtoH memcpy for all sends, and a single HtoD memcpy for all receives).

simonbyrne commented 1 year ago

This looks like it might be fixed

Screenshot 2023-04-17 at 5 47 28 PM
vchuravy commented 1 year ago

That looks indeed much better!

charleskawczynski commented 1 year ago

This looks like it might be fixed

Nice! What was the fix?

simonbyrne commented 1 year ago

I'm not sure: when I downloaded the artifact I referred to above, it seems to be fixed, so I think I messed up?