Add RCCL and OFFLOAD profiling support for AMD GPUs

gsitaram commented 2 years ago

This PR adds support for rcclReduceScatter using device buffers in order to improve the performance of this collective operation over processes in multiple nodes. In addition, support for roctx ranges have been added so that rocprof timelines can show both CPU and GPU activity. This code is more a proof of concept than a final product, more work is needed to make it production ready. I list the missing pieces below:

An explicit device buffer alloc+copy from HtoD are done before the rcclReduceScatter call and a DtoH transfer is done after it. These redundant alloc+copy HtoD must be eliminated by keeping the data in device buffers before the rcclReduceScatter. The extra device buffer for storing the result may be allocated only once at the beginning.
The roctx ranges macros (COSMA_RE, COSMA_RL) may be extended to nvtx calls. Currently they are used only when running on AMD GPUs.
An optimization to use RCCL is to create the RCCL communicator only once. This works only if the sequence of PDGEMM calls are called with the same set of processes in the communicator ring. It does not work when we try to use RCCL across PDGEMM calls with vastly different MPI communicators. So when running CP2K's RPA benchmark, for instance, some initial PDGEMM calls have to be routed to scalapack using COSMA_DIM_THRESHOLD environment variable so that performance could be gained in the RI-RPA calculation stage. It would be nice to fix this bug so that users do not have to worry about which PDGEMM calls to avoid.
I would like the maintainers of COSMA to choose how to name the various variables such as COSMA_WITH_RCCL, COSMA_WITH_OFFLOAD_PROFILING, etc. and decide whether they should be implemented as CMake variables or environment variables.

gsitaram commented 2 years ago

Hi @kabicm, what is the status of my PR? Do you intend to merge it some time? It works for CP2K at scale, I would like to see this code formally added to your repo.

kabicm commented 2 years ago

Hi Gina (@gsitaram),

You might have noticed that we made a new PR for this: https://github.com/eth-cscs/COSMA/pull/102, which in addition, brings the following optimizations:

adds the NCCL backend for both AMD and NVIDIA GPUs.
caches and reuses all MPI communicators
caches and reuses all device memory (also the memory needed for NCCL).
avoids double reshuffling of data, i.e. the data from NCCL buffers is immediately copied in the right layout, without additional reshuffling.

All these changes required substantial refactoring of the code, which is why it was easier for us to implement it in a new PR.

The only thing left to do for this PR is to make it work for AMD is just to fix building, i.e. to add the cmake scripts for finding hipblas and other hip-related libraries. The code itself is already AMD-ready.

It would be great to talk some of these days, as we are planning to merge this soon.

Cheers, Marko

alazzaro commented 2 years ago

Hello @kabicm I'm chiming in here to ask if the new COSMA will have the possibility to keep the GPU buffers to enable G2G. Currently, I'm hacking the library by replacing this call with the following code:

     // first transfer send_pointer to GPU
      Scalar *d_send_pointer=NULL, *d_receive_pointer=NULL;
      int nranks;
      MPI_Comm_size(comm, &nranks);

      hipMalloc((void **)&d_send_pointer, nranks*recvcnts[0]*sizeof(Scalar));
      hipMalloc((void **)&d_receive_pointer, recvcnts[0]*sizeof(Scalar));
      hipMemcpy(d_send_pointer, send_pointer, nranks*recvcnts[0]*sizeof(Scalar), hipMemcpyHostToDevice);

      // Use GPU pointers
      MPI_Reduce_scatter_block(d_send_pointer,
                               d_receive_pointer,
                               recvcnts[0],
                               mpi_type,
                               MPI_SUM,
                               comm);

      hipMemcpy(receive_pointer, d_receive_pointer, recvcnts[0]*sizeof(Scalar), hipMemcpyDeviceToHost);
      hipFree(d_send_pointer);
      hipFree(d_receive_pointer);

Basically, I do copy the data in/out to the device to have the MPI call to run on the GPU (which is the winning solution). Clearly, we can avoid at least one of the copies (and memory allocation) if you provide the buffers of the data allocated on the GPU (assuming that it is possible). Is it something the new COSMA will have? Do you think I can start to the the new PR with the HIP backend?

kabicm commented 2 years ago

These issues have been resolved in https://github.com/eth-cscs/COSMA/pull/102, so we are closing this PR. Thanks @gsitaram for your contribution!

eth-cscs / COSMA

Add RCCL and OFFLOAD profiling support for AMD GPUs #101