Problem description,
HGX A800 runs a single machine N GPUs NCCL test and finds that the performance bottleneck is all in NVLink.
However, there is a significant difference in performance between single machine 2 GPUs/4 GPUs/8 GPUs.
What is the reason for this?
When msgSize=4G, the performance of the HGX A800 single machine 2 GPUs/4 GPUs/8 GPUs nccl test is 143GB/s, 156GB/s, and 156GB/s, respectively;
When msgSize=256M, the performance of the HGX A800 single machine 2 GPUs/4 GPUs/8 GPUs nccl test is 130GB/s, 145GB/s, and 151GB/s, respectively.
As a comparison, we tested the single machine multi-GPU nccl-test data of HGX H800.
When msgSize=4G, there is no difference in performance between HGX H800 single machine 2 GPUs/4 GPUs/8 GPUs tested.
When msgSize=256M, there is a performance difference of 150GB/s, 157GB/s, and 160GB/s for HGX H800 testing single machine 2 GPUs/4 GPUs/8 GPUs, respectively.
Problem description, HGX A800 runs a single machine N GPUs NCCL test and finds that the performance bottleneck is all in NVLink. However, there is a significant difference in performance between single machine 2 GPUs/4 GPUs/8 GPUs. What is the reason for this?
When msgSize=4G, the performance of the HGX A800 single machine 2 GPUs/4 GPUs/8 GPUs nccl test is 143GB/s, 156GB/s, and 156GB/s, respectively; When msgSize=256M, the performance of the HGX A800 single machine 2 GPUs/4 GPUs/8 GPUs nccl test is 130GB/s, 145GB/s, and 151GB/s, respectively.
As a comparison, we tested the single machine multi-GPU nccl-test data of HGX H800. When msgSize=4G, there is no difference in performance between HGX H800 single machine 2 GPUs/4 GPUs/8 GPUs tested. When msgSize=256M, there is a performance difference of 150GB/s, 157GB/s, and 160GB/s for HGX H800 testing single machine 2 GPUs/4 GPUs/8 GPUs, respectively.
For detailed testing results,please refer to the attachment. Thanks a lot! Differences problems in performance data of HGX A800 single server N GPUs nccl testing.docx