Open chinthysl opened 6 months ago
./nccl-tests/build/all_reduce_perf -b 1G -e 8G -f 2 -g 2
H100 PCIe
... NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 ... [1] NCCL INFO Channel 00/0 : 1[21000] -> 0[1000] via P2P/direct pointer ... # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1073741824 268435456 float sum -1 9306.7 115.37 115.37 0 9291.5 115.56 115.56 0 2147483648 536870912 float sum -1 18535 115.86 115.86 0 18524 115.93 115.93 0 4294967296 1073741824 float sum -1 36943 116.26 116.26 0 36917 116.34 116.34 0 8589934592 2147483648 float sum -1 73849 116.32 116.32 0 73767 116.45 116.45 0
A100 SXM4
[1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] -1/-1/-1->1->0 [19] -1/-1/-1->1->0 [20] -1/-1/-1->1->0 [21] -1/-1/-1->1->0 [22] -1/-1/-1->1->0 [23] -1/-1/-1->1->0 ... [1] NCCL INFO Channel 00/0 : 1[cb000] -> 0[c8000] via P2P/direct pointer/read ... # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1073741824 268435456 float sum -1 5219.4 205.72 205.72 0 5227.2 205.41 205.41 0 2147483648 536870912 float sum -1 10046 213.77 213.77 0 10045 213.78 213.78 0 4294967296 1073741824 float sum -1 19513 220.11 220.11 0 19552 219.67 219.67 0 8589934592 2147483648 float sum -1 38564 222.75 222.75 0 38588 222.60 222.60 0
GPU datasheets say H100 PCIe and A100 SXM4 both have 600GB/s BW. Following is my p2pbandwith test reports.
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1676.95 336.21 104.50 76.25 71.94 70.57 65.92 64.72 1 336.77 1221.14 73.51 73.59 66.87 66.53 65.66 70.42 2 105.07 75.61 1674.17 336.95 64.95 64.59 68.98 72.20 3 75.84 73.63 337.74 1221.57 65.84 69.05 65.77 66.25 4 71.79 65.49 66.65 64.98 1676.61 339.44 75.36 76.10 5 69.16 66.69 65.28 70.19 339.69 1218.43 73.64 74.15 6 66.41 65.72 71.44 65.11 75.44 73.29 1220.27 362.06 7 66.96 69.93 72.22 66.47 75.84 73.93 360.87 1220.78
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1562.50 423.16 425.59 424.77 424.77 423.97 425.12 424.63 1 424.20 1553.95 424.65 425.24 424.66 425.34 424.66 425.56 2 423.39 425.24 1548.56 424.66 425.00 425.24 425.47 423.28 3 426.68 427.07 427.13 1562.50 430.04 429.44 429.44 430.15 4 426.60 426.68 427.34 432.31 1601.74 517.65 517.48 519.19 5 426.48 426.30 426.89 431.91 517.65 1593.57 517.65 518.33 6 426.44 426.91 426.90 430.63 518.85 519.35 1600.10 519.89 7 426.40 426.96 427.42 428.36 519.08 519.02 519.01 1596.83
./nccl-tests/build/all_reduce_perf -b 1G -e 8G -f 2 -g 2
H100 PCIe
A100 SXM4
GPU datasheets say H100 PCIe and A100 SXM4 both have 600GB/s BW. Following is my p2pbandwith test reports.
H100 PCIe
A100 SXM4