NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 818 forks source link

simple asynchronous RMA communication primatives in nccl #341

Open nevion opened 4 years ago

nevion commented 4 years ago

Following the premier of ncclSend/ncclRecv and the discussion of current effects of overlapping compute and IO (tricky) - perhaps there's an 'easy' way out. What if nccl exposed simple RMA methods of read and write - depending on implementation this strategy may not have to be blocking or resident on the device taking up compute blocks.

What complications are already visible with such methods of both the local GPU transport and infiniband transport?

Also, stupid question - why does the intranode p2p case use kernels vs memcopy?

sjeaugey commented 4 years ago

We use CUDA kernels for a few reasons.

First, we are running a communication protocol on the GPU (not just copies) to permit a constant pipeline flow. Admittedly, for send/recv (contrary to e.g. rings), we don't need to pipeline much since there is only one step, except when we're sending through the network.

Then, running CUDA kernels lets us communicate in parallel with many different peers using different CUDA blocks (and we could even split blocks in the future to increase our parallelism), whereas a cudaMemcpy operation would only be able to perform one copy to a given peer. So, using CUDA kernels provides much more opportunities for optimization and communication protocols / techniques.

Finally, and maybe the main reason is that NCCL operations are CUDA operations, submitted on a CUDA stream. For intra-node communication, we could want to replace that by a cudaMemcpy on a stream, but then we'd need to have the receive buffer already mapped on the sender side. Which means we'd have to wait for the receiver, then use CUDA IPCs to map the buffer, which takes a significant amount of time unless you want to keep things mapped, and manage a cache of mapped remote buffers [note this is what MPI does, and it comes with both advantages and issues]. For inter-node, NIC operations are not CUDA operations, so the NCCL kernel is there to block the CUDA stream until the data is sent (or received) by the NIC.

Exposing RMA operations would also require a heavy amount of CUDA IPC or NIC registration of GPU buffers to be efficient, plus it would actually be more complex for developers since they would have to manage the synchronization themselves.

All that to say, this would require a very significant amount of work since it would need many things that we don't have in NCCL today and it's not clear how efficient it would be eventually.

nevion commented 4 years ago

For a remote write on the receive end, cudaMemcpy takes care of mapping this already - no? I do intra-node communications like this already in the case of halo exchanges and it works without issue. It is arguably the simplest and at least in the simple halo exchange case, the optimization opportunities of kernels seems less available. I suppose an inplace gather would be a potential optimization though. FWIW I exchange "receive" streams that other domains may use and the RMA "write" targets those streams. There is some upfront and post stream synchronization to make the stream side work.

For the address cache side - towards simplicity by always keeping an argument peer, you could avoid the address decode/encode and cache step entirely. I like the explicitness of this rather than having something more complex. In the infiniband step, this would let you pass the rkey around right? I think the con would be the address from a remote peer would depend on the transport and so you'd need a function to query what address the memory region is for a given peer. Perhaps this is all half broken, though.

I would say in keeping with verbs memory region registration, it would be expected to register buffers with nccl so they can be used with RMA.

Btw consider infiniband bandwidth (12GB/s shared on DGX-1 style cliques) as well as nvlinks < 40GB/s - time not utilized doesn't come back. Consider further that it is very likely that some computations take longer than others based on data dependency and partitioning. It's a larger difference on adaptive grids - tight synchronization of processes is not possible and it's a battle between communication time and synchronization for who sets the scalability limit. What this means is that in halo exchange, where you have many neighbors - you want to try and overlap those transfers as much as possible so nothing is competing on shared resources when the last peer is ready - and there are spreads of values being ready - in the spectrum of single 10s of milliseconds to low hundreds - and values exchanged can also hundreds of megabytes per peer. I consider it very important that NCCL gets a plan on how to deal with staggered asynchronous communications, while allowing overlapping with compute - RMA-like seems to be a good way to achieve that - possibly the best. Jumping out of the box for a moment - it seems strange that behaviors given by cudaMemcpy and IBV_WR_RDMA_WRITE/IBV_WR_RDMA_READ are difficult to expose up with the sophisticated capabilities of nccl.

eric-haibin-lin commented 3 years ago

Exposing RMA operations would also require a heavy amount of CUDA IPC or NIC registration of GPU buffers to be efficient, plus it would actually be more complex for developers since they would have to manage the synchronization themselves.

@sjeaugey this might not be relevant to the main thread but, does GDR itself work with GPU mem handles acquired by CudaIPCMemOpen from another process?