NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.97k stars 758 forks source link

Data transfer between GPU and Proxy #852

Open fyf2016 opened 1 year ago

fyf2016 commented 1 year ago

Hi, I'm reading the NCCL source code, but what confuses me is that I don't know when the GPU data is ready and passed to the Proxy. If two GPUs are transmitted in Net mode by sharing Host mem, what is the transmission process between GPU-Proxy and Proxy-GPU?

My understanding is that the GPU will first map a piece of memory to the CPU memory of Host mem, and then when the GPU prepares the data, it will first send the data to the shared memory of Host mem, and then transmit it to another GPU through the network. I don't know how the Proxy knows that the GPU has data ready, and where in the source code it puts the data into the buffer and sends it to the GPU.

sjeaugey commented 1 year ago

The protocol is the same between GPU->GPU and GPU->Proxy, more or less. So there is no special code, the proxy is acting as the next GPU and checking head/tail counters in the FIFO, then pushing buffers to the next GPU and again play the same protocol.

fyf2016 commented 1 year ago

Suppose I want to communicate between GPU1 and GPU2. I apply for a buffer buff1 on GPU1, which contains the data I store. Next, according to the net communication method of nccl, I need to transfer the data in buff1 to the shared buffer area buff0 (there is a mapping relationship between buff and the cpu memory of Host mem). But how does nccl transfer the data from buff1 to buff0?

sjeaugey commented 1 year ago

That's done through the netSendProgress/netRecvProgress functions calling the network plugin isend/irecv functions?

jiangxiaobin96 commented 1 year ago

When i use ncclSend function, i will define a block of dev memory which contain send data. If i want to send data to an other gpu, nccl needs to transfer the data to shared memory and send data from shared memory using netSendProgress. So where transfer the address user defined to shared memory address?

sjeaugey commented 1 year ago

The user buffer -> network buffer copy happens in the CUDA kernels (src/collectives/device/prims*.h)

fyf2016 commented 1 year ago
image

Why send data from buff to GPU first in the sendProxyProgress function? Where does the data in this buff come from?

sjeaugey commented 1 year ago

This steps is providing the buffer address to the GPU, so that the GPU knows where to write data to. It's only providing a pointer, not the data.

fyf2016 commented 1 year ago

Is the buffer address the address that the GPU maps to the CPU buffer in Host mem? (If the two GPUs use Net mode, communicate through Host mem)

sjeaugey commented 1 year ago

Yes, that is correct. If we use GPU Direct RDMA, that address is in GPU memory, otherwise it is in CPU memory (allocated through cudaHostAlloc).

fyf2016 commented 1 year ago

OK~ Thanks a lot for helping me to solve my doubts.

fyf2016 commented 1 year ago

Wait a minute, I hava last question, how the data transfer to mapped buffer in GPU? Is it using the StoreLL and readLL commands? Are there other commands involved?

ZhiyiHu1999 commented 2 weeks ago

Yes, that is correct. If we use GPU Direct RDMA, that address is in GPU memory, otherwise it is in CPU memory (allocated through cudaHostAlloc).

Hello, @sjeaugey. Does sendproxyprogress and recvproxyprogress only include net mode (GPU-CPU(proxy)-net-CPU(proxy)-GPU) or both net mode and RDMA (GPU-net-GPU) are included in sendproxyprogress and recvproxyprogress? Thanks a lot!

sjeaugey commented 2 weeks ago

Wait a minute, I hava last question, how the data transfer to mapped buffer in GPU? Is it using the StoreLL and readLL commands? Are there other commands involved?

StoreLL is only used for the LL protocol. For the simple protocol, the reduceCopy functions perform loads and stores.

Does sendproxyprogress and recvproxyprogress only include net mode (GPU-CPU(proxy)-net-CPU(proxy)-GPU) or both net mode and RDMA (GPU-net-GPU) are included in sendproxyprogress and recvproxyprogress

As mentioned before, the only difference with GPU Direct RDMA is that the buffer is in GPU memory instead of CPU memory. All the code is the same.

ZhiyiHu1999 commented 5 days ago

Wait a minute, I hava last question, how the data transfer to mapped buffer in GPU? Is it using the StoreLL and readLL commands? Are there other commands involved?

StoreLL is only used for the LL protocol. For the simple protocol, the reduceCopy functions perform loads and stores.

Does sendproxyprogress and recvproxyprogress only include net mode (GPU-CPU(proxy)-net-CPU(proxy)-GPU) or both net mode and RDMA (GPU-net-GPU) are included in sendproxyprogress and recvproxyprogress

As mentioned before, the only difference with GPU Direct RDMA is that the buffer is in GPU memory instead of CPU memory. All the code is the same.

Hello, @sjeaugey . Thanks a lot! I have two more questions: (1) To my limited understanding, the cuda kernel is responsible for moving the data from user defined sending buffer to shared buffer between GPUs, then sendProxyProgress is responsible for moving the data in the shared buffer from sender to receiver through the network. Finally, the cuda kernel moves the data from shared buffer to receiving buffer. Is the procedure correct and if so, what is recvProxyProgress calling irecv responsible for? In brief, I am not sure isend/irecv moves data from where to where.

(2) The sendProxyProgress calls isend when there are certain amount of jobs in the circular FIFO shared buffer but how does recvProxyProgress determine to call irecv?

Thanks a lot for your kind help!