Open fyf2016 opened 1 year ago
The protocol is the same between GPU->GPU and GPU->Proxy, more or less. So there is no special code, the proxy is acting as the next GPU and checking head/tail counters in the FIFO, then pushing buffers to the next GPU and again play the same protocol.
Suppose I want to communicate between GPU1 and GPU2. I apply for a buffer buff1 on GPU1, which contains the data I store. Next, according to the net communication method of nccl, I need to transfer the data in buff1 to the shared buffer area buff0 (there is a mapping relationship between buff and the cpu memory of Host mem). But how does nccl transfer the data from buff1 to buff0?
That's done through the netSendProgress/netRecvProgress functions calling the network plugin isend/irecv functions?
When i use ncclSend function, i will define a block of dev memory which contain send data. If i want to send data to an other gpu, nccl needs to transfer the data to shared memory and send data from shared memory using netSendProgress. So where transfer the address user defined to shared memory address?
The user buffer -> network buffer copy happens in the CUDA kernels (src/collectives/device/prims*.h)
Why send data from buff to GPU first in the sendProxyProgress
function? Where does the data in this buff come from?
This steps is providing the buffer address to the GPU, so that the GPU knows where to write data to. It's only providing a pointer, not the data.
Is the buffer address the address that the GPU maps to the CPU buffer in Host mem? (If the two GPUs use Net mode, communicate through Host mem)
Yes, that is correct. If we use GPU Direct RDMA, that address is in GPU memory, otherwise it is in CPU memory (allocated through cudaHostAlloc).
OK~ Thanks a lot for helping me to solve my doubts.
Wait a minute, I hava last question, how the data transfer to mapped buffer in GPU? Is it using the StoreLL and readLL commands? Are there other commands involved?
Yes, that is correct. If we use GPU Direct RDMA, that address is in GPU memory, otherwise it is in CPU memory (allocated through cudaHostAlloc).
Hello, @sjeaugey. Does sendproxyprogress and recvproxyprogress only include net mode (GPU-CPU(proxy)-net-CPU(proxy)-GPU) or both net mode and RDMA (GPU-net-GPU) are included in sendproxyprogress and recvproxyprogress? Thanks a lot!
Wait a minute, I hava last question, how the data transfer to mapped buffer in GPU? Is it using the StoreLL and readLL commands? Are there other commands involved?
StoreLL is only used for the LL protocol. For the simple protocol, the reduceCopy
functions perform loads and stores.
Does sendproxyprogress and recvproxyprogress only include net mode (GPU-CPU(proxy)-net-CPU(proxy)-GPU) or both net mode and RDMA (GPU-net-GPU) are included in sendproxyprogress and recvproxyprogress
As mentioned before, the only difference with GPU Direct RDMA is that the buffer is in GPU memory instead of CPU memory. All the code is the same.
Wait a minute, I hava last question, how the data transfer to mapped buffer in GPU? Is it using the StoreLL and readLL commands? Are there other commands involved?
StoreLL is only used for the LL protocol. For the simple protocol, the
reduceCopy
functions perform loads and stores.Does sendproxyprogress and recvproxyprogress only include net mode (GPU-CPU(proxy)-net-CPU(proxy)-GPU) or both net mode and RDMA (GPU-net-GPU) are included in sendproxyprogress and recvproxyprogress
As mentioned before, the only difference with GPU Direct RDMA is that the buffer is in GPU memory instead of CPU memory. All the code is the same.
Hello, @sjeaugey . Thanks a lot! I have two more questions: (1) To my limited understanding, the cuda kernel is responsible for moving the data from user defined sending buffer to shared buffer between GPUs, then sendProxyProgress is responsible for moving the data in the shared buffer from sender to receiver through the network. Finally, the cuda kernel moves the data from shared buffer to receiving buffer. Is the procedure correct and if so, what is recvProxyProgress calling irecv responsible for? In brief, I am not sure isend/irecv moves data from where to where.
(2) The sendProxyProgress calls isend when there are certain amount of jobs in the circular FIFO shared buffer but how does recvProxyProgress determine to call irecv?
Thanks a lot for your kind help!
Hi, I'm reading the NCCL source code, but what confuses me is that I don't know when the GPU data is ready and passed to the Proxy. If two GPUs are transmitted in Net mode by sharing Host mem, what is the transmission process between GPU-Proxy and Proxy-GPU?
My understanding is that the GPU will first map a piece of memory to the CPU memory of Host mem, and then when the GPU prepares the data, it will first send the data to the shared memory of Host mem, and then transmit it to another GPU through the network. I don't know how the Proxy knows that the GPU has data ready, and where in the source code it puts the data into the buffer and sends it to the GPU.