eth-cscs / COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
BSD 3-Clause "New" or "Revised" License
196 stars 27 forks source link

Add RCCL and OFFLOAD profiling support for AMD GPUs #101

Closed gsitaram closed 2 years ago

gsitaram commented 2 years ago

This PR adds support for rcclReduceScatter using device buffers in order to improve the performance of this collective operation over processes in multiple nodes. In addition, support for roctx ranges have been added so that rocprof timelines can show both CPU and GPU activity. This code is more a proof of concept than a final product, more work is needed to make it production ready. I list the missing pieces below:

gsitaram commented 2 years ago

Hi @kabicm, what is the status of my PR? Do you intend to merge it some time? It works for CP2K at scale, I would like to see this code formally added to your repo.

kabicm commented 2 years ago

Hi Gina (@gsitaram),

You might have noticed that we made a new PR for this: https://github.com/eth-cscs/COSMA/pull/102, which in addition, brings the following optimizations:

All these changes required substantial refactoring of the code, which is why it was easier for us to implement it in a new PR.

The only thing left to do for this PR is to make it work for AMD is just to fix building, i.e. to add the cmake scripts for finding hipblas and other hip-related libraries. The code itself is already AMD-ready.

It would be great to talk some of these days, as we are planning to merge this soon.

Cheers, Marko

alazzaro commented 2 years ago

Hello @kabicm I'm chiming in here to ask if the new COSMA will have the possibility to keep the GPU buffers to enable G2G. Currently, I'm hacking the library by replacing this call with the following code:

     // first transfer send_pointer to GPU
      Scalar *d_send_pointer=NULL, *d_receive_pointer=NULL;
      int nranks;
      MPI_Comm_size(comm, &nranks);

      hipMalloc((void **)&d_send_pointer, nranks*recvcnts[0]*sizeof(Scalar));
      hipMalloc((void **)&d_receive_pointer, recvcnts[0]*sizeof(Scalar));
      hipMemcpy(d_send_pointer, send_pointer, nranks*recvcnts[0]*sizeof(Scalar), hipMemcpyHostToDevice);

      // Use GPU pointers
      MPI_Reduce_scatter_block(d_send_pointer,
                               d_receive_pointer,
                               recvcnts[0],
                               mpi_type,
                               MPI_SUM,
                               comm);

      hipMemcpy(receive_pointer, d_receive_pointer, recvcnts[0]*sizeof(Scalar), hipMemcpyDeviceToHost);
      hipFree(d_send_pointer);
      hipFree(d_receive_pointer);

Basically, I do copy the data in/out to the device to have the MPI call to run on the GPU (which is the winning solution). Clearly, we can avoid at least one of the copies (and memory allocation) if you provide the buffers of the data allocated on the GPU (assuming that it is possible). Is it something the new COSMA will have? Do you think I can start to the the new PR with the HIP backend?

kabicm commented 2 years ago

These issues have been resolved in https://github.com/eth-cscs/COSMA/pull/102, so we are closing this PR. Thanks @gsitaram for your contribution!