problem with asynchronous peer to peer copy.

ivandrodri commented 4 years ago

Hi, I would like to convert my code in CUDA to CUPLA and I have some issues with mutli-GPU/single-CPU P2P async. copies. This is the error:

error: argument of type "cuplaStream_t" is incompatible with parameter of type "cudaStream_t"

This is a typical P2P copy in my code:

    const int numStreams = 10;
    cudaStream_t streams[numStreams];

    cuplaSetDevice (idGpuI);
    for (size_t i=0; i<numStreams; ++i)
        cudaStreamCreate(&streams[i]);

   for (size_t loc1=0; loc1<grid1Size*grid1Size*grid1Size; ++loc1)
   {
       cudaMemcpyPeerAsync(&(grid0GpuJ[loc1].grid0Size), idGpuJ, &(grid0GpuI[loc1].grid0Size), 
       idGpuI, sizeof(int), streams[loc1%numStreams]);
   }

So how do I write this in CUPLA?

In order to give access to P2P copy, this is what I am doing in CUDA:

inline void enableP2P (int ngpus, std::string info) {
    for( int i = 0; i < ngpus; i++ ) {
         cuplaSetDevice (i);
         for(int j = 0; j < ngpus; j++) {
            if(i == j) continue;
            int peer_access_available = 0;
            cudaDeviceCanAccessPeer(&peer_access_available, i, j);
            if (peer_access_available) {
                cudaDeviceEnablePeerAccess(j, 0);
                if (info=="info")
                    printf("> GPU%d enabled direct access to GPU%d\n",i,j);
                }else {
                if (info=="info")
                    printf("(%d, %d)\n", i, j);
                }
        }
    }
}

It seems in CUPLA cudaDeviceEnablePeerAccess is done automatically and cudaDeviceCanAccessPeer disappears, so I think the function enableP2P is not necessary anymore, right?

Thanks for any help!

[edited by psychocoderHPC: fixed formation]

sbastrakov commented 4 years ago

Hello @ivandrodri . This is a good question. I have personally never tried this, but here are my thoughts after looking at the code just now. I am sure @psychocoderHPC knows this, but he is currently on vacation.

I believe alpaka and so cupla have all memory buffers attached to a device already. For cupla, when you create a buffer with cuplaMalloc, the buffer is attached to the device active at the moment (can be set via cuplaSetDevice).

Then the copy functions like cuplaMemcpyAsync just take two buffers, and internally check whether it's the same device or not and take care of it. For the CUDA backend, this function should just call either cudaMemcpyAsync or cudaMemcpyPeerAsync. I think you are right about automatic cudaDeviceEnablePeerAccess, this should not be needed.

So in your case, I think just creating buffers while the right devices are active, and then using the universal copy functions should work.

Regarding the streams and asynchronous part (which is orthogonal to peer-to-peer), it mirrors CUDA: create streams via cuplaStreamCreate, store them as type cuplaStream_t, provide such a variable as last parameter to cuplaMemcpyAsync.

ivandrodri commented 4 years ago

Hello @sbastrakov, thanks a lot for the help! I'll try it and I let you know.

psychocoderHPC commented 3 years ago

@ivandrodri Sry for the late response

Do you solve this issue already? I never tried peer mem copies but alpaka should do the job transparently for you. A simple cuplaMemcpyAsync should be enouph.

alpaka-group / cupla

problem with asynchronous peer to peer copy. #178