Closed YJHMITWEB closed 2 weeks ago
The nvidia-smi topology matrix do not indeed tell you whether the 12 NVLinks are direct connections or are going to an NVSwitch connecting all GPUs.
But the NCCL topology does show that all GPUs are connected to 6 NVswitches, which is expected given you're using a DGX A100 which is supposed to have all 8 GPUs connected through NVSwitch.
NVLS, a.k.a. NVLink SHARP is only supported on Hopper and later (H100) and is therefore disabled here on A100.
Thank you @sjeaugey , this is clear!!
Hi, I am trying to figure out the topology on the DGX A100 40GB node that I have access to.
First, I use
nvidia-smi topo -m
to check the links:As shown above, each pair of GPUs is connected via
NV12
, which means there are 12 NVLinks. From what I understand, this means, for GPU0, it has in total12*(8-1)=84
NVLinks.However, if I dump the NCCL topo file using
NCCL_TOPO_DUMP_FILE=topo_file
, it shows:And I found that in fact, every GPU is connected to the same 6 targets:
To each target, there are 2 NVLinks, which means, for example, for GPU0,
NV12
is actually shared among its path to all other 7 GPUs.And I further checked
comm->nvlsSupport
afterncclNvlsInit(comm)
, it is0
.I am wondering if these 8 GPUs are not directly connected to each other, and there is no NVSwitch, then what are these 6 targets shown in the dumped topology file?