NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 789 forks source link

Only ~783GByte/s out of theoretical 900GB/s HGX H100 SXM Nvlink4 #1264

Open OrenLeung opened 4 months ago

OrenLeung commented 4 months ago

Hi! I am running the nvidia provided p2p bandwidth test and only achieved bidirectional of 749GB/s out of the marketed theoretical 900GB/s and unidirectional 380GB/s out of 450GB/s theoretical on H100 SXM Nvlink4. I see that @stas00 was only able to achieve 376GB/s too. stas results

749 of out 900 means there even in the best case of this p2p test that it was only able to achieve 83% of the marketed theoretical peak bandwidth.

Reprod Script, Full Output and Full Setup is provided below for our connivence. Please let me know if this is expected or am I missing something?

Bidirectional results (748GB/s out of 900GB/s)

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2573.14 742.21 744.11 749.03 740.60 742.12 742.30 741.20 
     1 744.49 2578.12 741.71 742.24 741.40 741.98 743.15 741.97 
     2 740.63 773.25 2569.51 744.03 741.82 741.31 752.89 749.18 
     3 739.42 742.34 772.86 2574.53 742.26 741.05 741.21 741.51 
     4 748.38 742.66 740.10 741.47 2573.67 741.88 742.54 741.77 
     5 748.36 741.19 740.85 741.14 740.29 2578.38 744.75 742.73 
     6 748.86 741.66 741.87 743.75 739.95 741.62 2572.61 741.75 
     7 748.57 741.45 741.02 743.01 741.86 741.53 740.83 2576.72 

Undirectional Results (380GB/s out of 450GB/s)

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2494.76 371.02 376.65 376.60 376.82 375.59 375.94 375.60 
     1 376.53 2528.57 376.11 376.37 376.18 377.11 375.93 376.86 
     2 368.36 393.03 2514.46 378.81 376.32 375.96 376.23 376.12 
     3 381.44 375.28 392.30 2519.65 376.47 376.62 375.99 380.77 
     4 379.53 375.94 375.42 392.11 2510.29 375.40 376.02 375.61 
     5 378.61 376.63 377.58 376.30 376.09 2520.54 376.35 375.54 
     6 379.78 376.04 375.99 376.17 376.50 376.45 2519.53 375.25 
     7 380.27 376.79 375.69 375.63 376.25 376.38 376.12 2519.91 

Full Results Logs

Toggle to see full p2pbandwidthlatencytest output $ ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA H100 80GB HBM3, pciBusID: 18, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA H100 80GB HBM3, pciBusID: 2a, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA H100 80GB HBM3, pciBusID: 3a, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA H100 80GB HBM3, pciBusID: 5d, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA H100 80GB HBM3, pciBusID: 9a, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA H100 80GB HBM3, pciBusID: ab, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA H100 80GB HBM3, pciBusID: ba, pciDeviceID: 0, pciDomainID:0 Device: 7, NVIDIA H100 80GB HBM3, pciBusID: db, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 2474.76 37.19 37.20 36.91 37.06 37.16 37.28 37.24 1 36.36 2512.44 37.35 36.32 36.95 37.12 36.74 36.30 2 36.34 36.27 2498.88 36.46 37.51 37.54 36.37 37.23 3 36.74 37.12 37.23 2499.38 37.00 37.86 36.69 37.02 4 36.87 37.10 37.36 37.14 2499.63 37.26 37.25 37.47 5 37.45 37.52 37.02 37.56 37.07 2514.71 38.05 37.26 6 36.79 36.40 37.49 37.41 37.31 37.10 2504.76 37.38 7 37.02 37.38 37.24 37.09 37.92 37.46 37.38 2503.38 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 2494.76 371.02 376.65 376.60 376.82 375.59 375.94 375.60 1 376.53 2528.57 376.11 376.37 376.18 377.11 375.93 376.86 2 368.36 393.03 2514.46 378.81 376.32 375.96 376.23 376.12 3 381.44 375.28 392.30 2519.65 376.47 376.62 375.99 380.77 4 379.53 375.94 375.42 392.11 2510.29 375.40 376.02 375.61 5 378.61 376.63 377.58 376.30 376.09 2520.54 376.35 375.54 6 379.78 376.04 375.99 376.17 376.50 376.45 2519.53 375.25 7 380.27 376.79 375.69 375.63 376.25 376.38 376.12 2519.91 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 2573.47 44.34 44.26 44.08 50.61 51.42 50.92 50.89 1 45.51 2579.38 45.63 45.12 51.87 52.10 51.28 51.96 2 43.98 44.81 2576.59 44.28 51.03 51.37 50.54 51.37 3 43.96 44.77 44.60 2579.18 51.01 50.91 51.17 50.50 4 50.88 51.46 50.95 50.70 2580.71 51.44 51.62 51.43 5 51.21 50.97 51.35 50.77 51.18 2577.78 51.51 51.32 6 50.89 50.99 50.82 50.91 52.17 51.24 2578.98 51.73 7 50.84 51.48 51.18 51.18 51.38 51.69 51.49 2579.91 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 2573.14 742.21 744.11 749.03 740.60 742.12 742.30 741.20 1 744.49 2578.12 741.71 742.24 741.40 741.98 743.15 741.97 2 740.63 773.25 2569.51 744.03 741.82 741.31 752.89 749.18 3 739.42 742.34 772.86 2574.53 742.26 741.05 741.21 741.51 4 748.38 742.66 740.10 741.47 2573.67 741.88 742.54 741.77 5 748.36 741.19 740.85 741.14 740.29 2578.38 744.75 742.73 6 748.86 741.66 741.87 743.75 739.95 741.62 2572.61 741.75 7 748.57 741.45 741.02 743.01 741.86 741.53 740.83 2576.72 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.38 13.56 12.75 12.92 13.35 13.31 15.03 19.14 1 14.85 2.14 13.06 13.67 13.83 14.08 13.77 15.11 2 13.41 12.83 2.31 13.39 18.86 20.05 21.71 13.84 3 12.52 13.17 13.22 2.18 13.42 14.29 15.51 14.92 4 12.89 13.60 13.19 13.11 2.32 12.82 21.12 21.12 5 12.78 12.91 12.67 12.34 21.11 2.16 12.70 12.69 6 12.65 14.00 12.62 12.84 21.27 21.35 2.22 12.96 7 12.75 13.16 13.49 12.82 12.78 12.78 21.37 2.13 CPU 0 1 2 3 4 5 6 7 0 2.29 6.90 6.73 6.79 6.23 6.29 6.36 6.20 1 6.74 2.27 6.77 6.92 6.28 6.48 6.39 6.31 2 6.76 6.80 2.11 6.89 6.28 6.44 6.34 6.17 3 6.65 6.79 6.69 2.14 6.40 6.40 6.41 6.22 4 6.37 6.49 6.36 6.43 2.04 6.57 6.52 6.35 5 6.97 7.00 6.85 6.97 6.05 2.02 6.10 5.97 6 6.40 6.52 6.40 6.49 6.02 6.00 2.00 5.94 7 6.39 6.44 6.34 6.45 6.02 6.02 5.94 1.98 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.36 3.29 2.26 2.77 2.77 2.77 2.25 2.81 1 2.27 2.11 2.27 2.31 2.26 2.80 2.25 2.83 2 2.27 2.77 2.33 2.78 2.78 2.75 2.78 2.28 3 3.31 2.79 2.83 2.19 2.30 2.80 2.80 2.80 4 2.93 2.40 2.95 2.89 2.34 2.89 2.90 2.93 5 2.38 2.33 2.33 2.33 2.36 2.14 2.34 2.32 6 2.94 2.87 2.91 2.34 2.87 2.95 2.24 2.92 7 2.91 2.33 2.89 2.88 2.88 2.88 2.87 2.12 CPU 0 1 2 3 4 5 6 7 0 2.24 1.79 1.77 1.77 1.78 1.78 1.79 1.76 1 1.86 2.23 1.82 1.82 1.79 1.83 1.81 1.80 2 1.85 1.81 2.22 1.80 1.80 1.85 1.80 1.78 3 1.89 1.82 1.81 2.21 1.85 1.79 1.81 1.81 4 1.69 1.66 1.65 1.66 1.99 1.64 1.69 1.67 5 1.73 1.75 1.67 1.68 1.69 2.02 1.69 1.69 6 1.73 1.68 1.70 1.74 1.70 1.72 2.07 1.71 7 1.79 1.68 1.76 1.68 1.68 1.70 1.69 2.03

Reprod

git clone https://github.com/NVIDIA/cuda-samples
cd cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest

Setup

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+


## Nvidia-smi Topology (it is NV18 across all to all gpus within a node)
```bash
nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-55,112-167    0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS 0-55,112-167    0       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS 0-55,112-167    0       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS 0-55,112-167    0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS 56-111,168-223  1       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS 56-111,168-223  1       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS SYS SYS SYS SYS SYS SYS PIX SYS 56-111,168-223  1       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX 56-111,168-223  1       N/A
NIC0    PIX SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS SYS SYS SYS             
NIC1    SYS PIX SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS SYS SYS             
NIC2    SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS SYS SYS SYS             
NIC3    SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS SYS             
NIC4    SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS SYS             
NIC5    SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS SYS             
NIC6    SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS             
NIC7    SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS SYS             
NIC8    SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  SYS             
NIC9    SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
AddyLaddy commented 4 months ago

This is not a NCCL issue. I suggest you contact your vendor or Nvidia technical sales representative.

OrenLeung commented 4 months ago

This is not a NCCL issue. I suggest you contact your vendor or Nvidia technical sales representative.

thanks.

on the bright side, i am now at 783Gbytes/s by using the max shape. but i am still missing ~100GByte/s

./p2pBandwidthLatencyTest --numElems=10474830000

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 2575.02 783.54 783.90 783.87 783.59 783.83 783.73 783.72 
     1 783.58 2662.83 783.63 783.72 783.69 783.73 783.56 783.67 
     2 783.85 783.69 2661.70 783.89 783.81 783.82 783.67 783.78 
     3 783.66 783.68 783.73 2662.10 783.86 783.94 783.80 783.65 
     4 783.80 783.70 783.64 783.83 2662.24 783.83 783.76 783.88 
     5 783.73 783.67 783.87 783.91 783.60 2662.16 783.55 783.84 
     6 783.61 783.69 783.71 783.69 783.69 783.76 2663.57 783.61 
     7 783.87 783.58 783.76 783.67 783.91 783.94 783.66 2662.34