Differences problems in performance data of HGX A800 single server N GPUs nccl testing

Problem description， HGX A800 runs a single machine N GPUs NCCL test and finds that the performance bottleneck is all in NVLink. However, there is a significant difference in performance between single machine 2 GPUs/4 GPUs/8 GPUs. What is the reason for this?

When msgSize=4G, the performance of the HGX A800 single machine 2 GPUs/4 GPUs/8 GPUs nccl test is 143GB/s, 156GB/s, and 156GB/s, respectively； When msgSize=256M, the performance of the HGX A800 single machine 2 GPUs/4 GPUs/8 GPUs nccl test is 130GB/s, 145GB/s, and 151GB/s, respectively.

As a comparison, we tested the single machine multi-GPU nccl-test data of HGX H800. When msgSize=4G, there is no difference in performance between HGX H800 single machine 2 GPUs/4 GPUs/8 GPUs tested. When msgSize=256M, there is a performance difference of 150GB/s, 157GB/s, and 160GB/s for HGX H800 testing single machine 2 GPUs/4 GPUs/8 GPUs, respectively.

For detailed testing results，please refer to the attachment. Thanks a lot！ Differences problems in performance data of HGX A800 single server N GPUs nccl testing.docx

NVIDIA / nccl-tests

Differences problems in performance data of HGX A800 single server N GPUs nccl testing #210