NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.95k stars 755 forks source link

Has NCCL support inter-node through NVswitch and NVlink? #1321

Open shanleo1986 opened 1 week ago

shanleo1986 commented 1 week ago

Hi Dear developer,

Has NCCL support inter-node through NVswitch 3.0 and NVlink 4.0? Or NVswitch 4.0+NVlink5.0? If NCCL support, can you kindly point out the different between inter-node through IB and inter-node through NVswitch+NVlink? For example, is there NET transport? How the GPU on the NODE create P2P transport(Or some other kind of transport?) with the other NODE?

Thank you.

shanleo1986 commented 1 week ago

If two hosts with 8 GPUs connected with NVlinks, is there some differece when searching TOPO on the two hosts with two hosts connected with IB? I think the two hosts will still search the local TOPO and generate the local channel, then connect with each other. But now they are not connectiong with IB NET card, so there is no NET in the channel, right? Did the two hosts connect with P2P transport in this case? @AddyLaddy , can you give me any hint?

Thank you.

shanleo1986 commented 1 week ago

After reading the issues: https://github.com/NVIDIA/nccl/issues/1159 and https://github.com/NVIDIA/nccl/issues/1286, I think I can get the conclusion. With MNNVL support, the NCCL will search the local TOPO follow the pervious Implementation. There is no IB NET, so each Node will create the local channels, and check if the inter-node can support MNNVL or not, if support, will set the P2P transport between inter-node.

Please correct me if there is any mistake or something missed.

Thanks.

kiskra-nvidia commented 1 week ago

Just to clarify:

You are correct that topology discovery in NCCL is node-local. However, for MNNVL support we added what we call "topo fusion", where nodes belonging to the same MNNVL clique exchange XML topology data with each other (using allgather) and merge those individual node topologies into a clique-level topology (a clique, BTW, is what we call a group of nodes interconnected using NVLinks). After the merge, to the rest of the NCCL code an MNNVL clique looks like a single node with a lot of GPUs.

As to how the GPUs from different nodes can communicate with each other, @AddyLaddy already explained that in #1159: the cuMem API supports "fabric handles", which make it possible to share memory between GPUs as if those GPUs were on the same node. This only works if those GPUs can talk to each other using NVLink; it doesn't work over IB.

shanleo1986 commented 6 days ago

HI @kiskra-nvidia, thank you for your response, this is very helpful to me. I still have several other questions. (1) I found the source code "topo fusion" you mentioned, if two same nodes are belonging to the same MNNVL, is there busid conflict inside the final xml topo file? Suppose there is the same GPU busid in each node. How does NCCL deal this case?

(2) How does the FM distinguish different GPU on different nodes? I checked the user guide of FM, which sayes FM use GPU Physical ID to uniquely identify each GPU. Please correct if I am wrong. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

(3) Can you help to explain a bit more about this, or is there any docs to explain this? How does NCCL support share memory between GPUs using fabric handles, thank you a lot!

the cuMem API supports "fabric handles", which make it possible to share memory between GPUs as if those GPUs were on the same node

Thank you!

kiskra-nvidia commented 5 days ago

(1) I found the source code "topo fusion" you mentioned, if two same nodes are belonging to the same MNNVL, is there busid conflict inside the final xml topo file? Suppose there is the same GPU busid in each node. How does NCCL deal this case?

Your knowledge of NCCL is quite impressive 😃. Indeed, we have had to adjust the code in various places to avoid using busid in places where it could cause problems with MNNVL. Our internal identifiers now include a systemid in high bits:

https://github.com/NVIDIA/nccl/blob/178b6b759074597777ce13438efb0e0ba625e429/src/graph/topo.h#L106-L108

Those ids are not exposed in the XML topo file though. In the topo files, "rank" is unique across all GPU nodes. There is no single unique attribute for NET nodes; however, CPU nodes now include a host_hash attribute which is unique per OS instance, so combining CPU's host_hash and NET's dev gives a unique id.

(2) How does the FM distinguish different GPU on different nodes? I checked the user guide of FM, which sayes FM use GPU Physical ID to uniquely identify each GPU. Please correct if I am wrong. https://docs.nvidia.com/datacenter/tesla/pdf/fabric-manager-user-guide.pdf

Sorry, I don't know the answer to this one.

(3) Can you help to explain a bit more about this, or is there any docs to explain this? How does NCCL support share memory between GPUs using fabric handles, thank you a lot!

It's really quite transparent. It works with cuMem (not with the legacy CUDA IPC). For regular intra-node P2P, we use CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR handles which are file descriptors that we pass between NCCL processes via UNIX domain sockets. For MNNVL, instead, we change the file handle type to CU_MEM_HANDLE_TYPE_FABRIC, and handles are then opaque 64-byte objects; see:

https://github.com/NVIDIA/nccl/blob/master/src/include/p2p.h

shanleo1986 commented 5 days ago

Hi @kiskra-nvidia , many thanks to you. I have another 3 questions, if you kindly give me some response, it will be appreciate.

(1) Does the usage of NCCL with mpirun change when support MNNVL? Let's say there are two NODEs connected with NVlink, and there are 8 GPUs each NODE.

One progress one GPU, using two NODEs: mpirun -np 16 -h host1:8,host2:8 all_reduce_perf -g 1 -n 20 -b 1M -e 1G -f 2

One progress one GPU, using only one NODEs, does this command support? As one NODE will handle all of the GPUs as if there are all local, I cannot make sure if this can work. mpirun -np 16 -h host1:16 all_reduce_perf -g 1 -n 20 -b 1M -e 1G -f 2

One progress 8 GPUs, using two NODEs: mpirun -np 2 -h host1:1,host2:1 all_reduce_perf -g 8 -n 20 -b 1M -e 1G -f 2

(2) Which way the Fabric Manager service manage the whole NODEs? Does it use the eth network and use socket to communicate accross progress?

(3) NVswitch3.0 already supported NODEs connected through NVlins, but that time NCCL had no MNNVL feature, how did it support NODEs connected through NVlins?

Thank you for your time and kindly support.