Mellanox / nv_peer_memory

305 stars 61 forks source link

GPUDirect RDMA sometimes misses writing a word to remote #31

Closed StrikeW closed 6 years ago

StrikeW commented 6 years ago

Hi, my project uses GPUDirect RDMA to send intermediate computation result on GPU memory to remote CPU memory (via RDMA WRITE). First, the data will be copied to a staging device buffer (the buffer was zeroed out before), which was registered as a MR, then the buffer and msg_sz will be passed to post_send() to send the data to remote.

// prepare RDMA buffer for RDMA-WRITE
char *rdma_buf = gmem->buffer(tid);
GPU_ASSERT( cudaMemcpy(rdma_buf, &data_sz, sizeof(uint64_t), cudaMemcpyHostToDevice) ); // header
rdma_buf += sizeof(uint64_t);
GPU_ASSERT( cudaMemcpy(rdma_buf, data, data_sz, cudaMemcpyDeviceToDevice) );    // data
rdma_buf += roundup(data_sz, sizeof(uint64_t));
GPU_ASSERT( cudaMemcpy(rdma_buf, &data_sz, sizeof(uint64_t), cudaMemcpyHostToDevice) );  // footer

My messaging protocol is same as FaRM, which uses a ring buffer to store messages. And in my case, the ring buffer only has one writer and one reader. The structure of a message is [ header | payload | footer ], and the size of payload is encoded in header and footer. The problem is sometimes I found some messages' footer was missing in the receiver side, its value becomes 0.

I resort to ibdump to dump the RDMA traffics, and found that the DMA Length in RETH is correct but the footer was missing in the last packet indeed! One thing to notice is that if I copy the data on GPU memory to host memory then send it via normal RDMA (w/o GPUDirect), then everything is ok. I have no idea why this happened, can you guys give me some hints?

Setup:

gpaulsen commented 6 years ago

I'm not associated with this project, I just stumbled upon this question.

I believe the answer is that you need to synchronize the host and the gpu. CUDA includes a number of synchronize calls (based on what granularity you want to synchronize). The heaviest is cudaDeviceSynchronize(). Adding this after your cudaMemcpy() should ensure that the GPU has completed it's transfer before your host side program continues.

StrikeW commented 6 years ago

@gpaulsen I use cudaMemcpy() to do the memory copy between host and device. This API is synchronized across all CUDA streams according to NVIDIA docs. But I will try your suggestion.

gpaulsen commented 6 years ago

Yes, please do. I believe the docs are wrong for some corner cases, but I don't understand it well enough to comment more.

StrikeW commented 6 years ago

@gpaulsen cudaDeviceSynchronize() does works! I add it after copying the data to the messaging buffer on GPU, then send it via RDMA. Thank you.

gpaulsen commented 6 years ago

Note that device synchronize is the largest (i.e. slowest) hammer for synchronization. Depending on the application, you might be able to get away with just a thread synchronize or something else, but you have to understand your app's needs. Device synchronize is the safest approach if the performance is acceptable.