For the Hopper GPU, NCCL uses write operations to transfer data. From my analysis using Nsight, it appears that the bandwidth occupied by Request Protocol Data is 1/8 of that of User Data, and the bandwidth for Response Protocol Data is negligible. Therefore, the effective bandwidth should theoretically be about 8/9 of the NVLink bandwidth.
I used the command "nvidia-smi nvlink -s" to query the NVLink bandwidth, but the result is not 25 as it is for the Ampere GPU.
Consequently, the theoretical effective bandwidth for the Hopper GPU should be around 26.5*8/9≈23.6. However, in actual testing, the bandwidth is indeed approximately 20.6.
Which part of my understanding above is incorrect?
Thank you very much for your time and assistance!
Hi, I have some questions regarding the NVLink bandwidth for the Hopper GPU and would appreciate your insights. It seems that NCCL considers the single nvlink bandwidth of a Hopper GPU as 20.6. https://github.com/NVIDIA/nccl/blob/178b6b759074597777ce13438efb0e0ba625e429/src/graph/topo.h#L17
For the Hopper GPU, NCCL uses write operations to transfer data. From my analysis using Nsight, it appears that the bandwidth occupied by Request Protocol Data is 1/8 of that of User Data, and the bandwidth for Response Protocol Data is negligible. Therefore, the effective bandwidth should theoretically be about 8/9 of the NVLink bandwidth.
I used the command "nvidia-smi nvlink -s" to query the NVLink bandwidth, but the result is not 25 as it is for the Ampere GPU. Consequently, the theoretical effective bandwidth for the Hopper GPU should be around 26.5*8/9≈23.6. However, in actual testing, the bandwidth is indeed approximately 20.6. Which part of my understanding above is incorrect? Thank you very much for your time and assistance!