Some question about NVLS and MNNVL.

shanleo2024 commented 2 months ago

Hi dear developer, I have a question about MNNVL and NVLS. If the whole topo support MNNVL, for example there are totally 5 nodes and each node includes 8 GPU cards. Then the total GPU number is 40, which is bigger than the MAXCHANNELS:32 definded by NCCL. I learned that the up count defined in struct ncclNvls is set to 32 in NCCL 2.22 version. Does it mean NVLS support 32 GPUs at most? If so, how about the other GPUs in the case of MNNVL?

#define NCCL_MAX_NVLS_ARITY 32
#define NCCL_MAX_NVLS_TREE_ARITY 3
struct ncclNvls {
  int out;
  int nHeads;   // Number of parallel N<->1<->net operations we'll do in parallel; size of up/down
  int headRank; // Index in 0..nHeads-1 I am the head rank of. -1 if I'm not a head rank (no local NIC)
  int up[NCCL_MAX_NVLS_ARITY];
  int down;
  int treeUp;
  int treeDown[NCCL_MAX_NVLS_TREE_ARITY];
  int node;
  int nNodes;
};

Thank you.

kiskra-nvidia commented 2 months ago

Yes, MNNVL systems do push various NCCL limits and, as larger systems become available, we continue to make changes to NCCL to support them. I think 32 GPUs is what the currently available public release is known to support.

AddyLaddy commented 2 months ago

The number of channels is not related to the number of GPUs in the NVLink domain. Hopper generation GPUs have 18 NVLink channels between the GPU and NVSwitch and the number of NCCL channels is directly related to that. In MNNVL systems each GPU is still connected to an NVSwitch so there is no difference in the number of channels used.

shanleo2024 commented 2 months ago

Hi @AddyLaddy , I am not mean how many channels used by NVLS, but the max count of the up count. As I notice that in the kernel, NCCL will scatter the inputbuffer using nvls->up, the max count of up is 32, so the max send is 32. But if there are 40 GPUs (For MNNVL), can only 32 GPUs do the NVLink sharp at the same time? How to deal with the other 8？

Thank you.

shanleo2024 commented 2 months ago

Hi @kiskra-nvidia , Got it, maybe this is a limitiation for the current release. Looking forward to NCCL's support to larger system. Thank you.

NVIDIA / nccl

Some question about NVLS and MNNVL. #1429