Open shanleo2024 opened 2 months ago
Yes, MNNVL systems do push various NCCL limits and, as larger systems become available, we continue to make changes to NCCL to support them. I think 32 GPUs is what the currently available public release is known to support.
The number of channels is not related to the number of GPUs in the NVLink domain. Hopper generation GPUs have 18 NVLink channels between the GPU and NVSwitch and the number of NCCL channels is directly related to that. In MNNVL systems each GPU is still connected to an NVSwitch so there is no difference in the number of channels used.
Hi @AddyLaddy , I am not mean how many channels used by NVLS, but the max count of the up count. As I notice that in the kernel, NCCL will scatter the inputbuffer using nvls->up, the max count of up is 32, so the max send is 32. But if there are 40 GPUs (For MNNVL), can only 32 GPUs do the NVLink sharp at the same time? How to deal with the other 8?
Thank you.
Hi @kiskra-nvidia , Got it, maybe this is a limitiation for the current release. Looking forward to NCCL's support to larger system. Thank you.
Hi dear developer, I have a question about MNNVL and NVLS. If the whole topo support MNNVL, for example there are totally 5 nodes and each node includes 8 GPU cards. Then the total GPU number is 40, which is bigger than the MAXCHANNELS:32 definded by NCCL. I learned that the up count defined in struct ncclNvls is set to 32 in NCCL 2.22 version. Does it mean NVLS support 32 GPUs at most? If so, how about the other GPUs in the case of MNNVL?
Thank you.