Open Joeyzhouqihui opened 2 years ago
There is a bit of overlap, but NVSHMEM is more like a lower-level API, providing put/get (load/store, one-sided) semantics at the CUDA level, while NCCL provides two-sided, bulk operations, launched at the CPU level.
It's easier to use NCCL since you can just add NCCL operation as part as your CUDA stream flow, and synchronization are taken care of by the two-sided semantics.
Using NVSHMEM is trickier as it happens inside your CUDA code and you have to write the synchronization code yourself (GPU A writes data to GPU B, sync, GPU B reads the data).
Thank you for your reply!
After reading the documents of NVSHMEM, I notice that its API seems to require direct communication between each pair of gpus. Is that to say, when there are two gpus that are connected with 2-hop nvlink (gpu1 and gpu3), NCCL can still utilize the nvlink while NVSHMEM cannot?
gpu1 <--nvlink--> gpu2 <--nvlink--> gpu3
NVSHMEM, just like NCCL, does magic to permit GPUs to communicate even when there isn't cuda p2p memory sharing available. So if nvlink isn't available, it will find something else, like sysmem. The API will always work. Whether or not it will perform as well (or better than) NCCL is something I cannot comment on.
@sjeaugey NVSHMEM can now leverage GPUDirect-KI, is it also enabled by NCCL?
After reading the documents of NVSHMEM, I notice that its API seems to require direct communication between each pair of gpus. Is that to say, when there are two gpus that are connected with 2-hop nvlink (gpu1 and gpu3), NCCL can still utilize the nvlink while NVSHMEM cannot?
gpu1 <--nvlink--> gpu2 <--nvlink--> gpu3
My experience explicitly experimenting this scenario on the DGX V100, is that GPU 1 and GPU 3 would communicate via the NIC, given that they have a P2P relationship that way, which aligns with NVSHMEM's documentation. If NVSHMEM does not detect a remotely accessible connection, say the machine has not configured GPUDirect RDMA or does not have a NIC, then GPU 1 and GPU 3 would be unable to communicate by default, modulo any software-defined routing like below.
The flexibility, the beauty of it really, of NVSHMEM is that you could write code to, for example, do GPU-based forwarding in that scenario, such that whenever GPU1 wants to talk to GPU3, it goes through a 2-hop NVLink connection rather than the NIC. Of course, NCCL does this heavy-lifting already, I believe, through its ring algorithms as in the DGX topology, for example, there are actually three parallel rings encompassing all 8 GPUs.
I am a heavy user of NCCL, but recently I am aware of a new toolkit named NVSHMEM, which allows different gpu devices to directly communicate with each other using one-side rdma-like verbs. I am wondering if the functionality of the two tools is overlapped? Could you please give me some instructions about when should I use NCCL instead of NVSHMEM? In another word, under what scenarios will NCCL outperform NVSHMEM?