NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

hypercube out-of-bound errors with single-proc + `gpus-per-thread=4`, not with multi-proc + `gpus-per-thread=1` #190

Open robogast opened 7 months ago

robogast commented 7 months ago

Tested with:

I tried:

Outputs:

single-node, single-process -> fail ❌ ```bash $ srun -N 1 -n 1 --gpus-per-node 4 -p gpu bash -c "module load 2023 NCCL/2.19.3-GCCcore-12.3.0-CUDA-12.1.1 OpenMPI/4.1.5-GCC-12.3.0; export OMPI_MCA_pml=ucx; export NCCL_P2P_DIRECT_DISABLE=1; ./build/hypercube_perf --nthreads 1 --ngpus 4 --datatype bfloat16 --minbytes 8 --maxbytes 8G --stepfactor 2" [gcn63.local.snellius.surf.nl:3158360] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3158360] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3158360] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3158360] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 # nThread 1 nGpus 4 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 3158360 on gcn63 device 0 [0x31] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 3158360 on gcn63 device 1 [0x32] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 3158360 on gcn63 device 2 [0xca] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 3158360 on gcn63 device 3 [0xe3] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 bfloat16 -1 22.64 0.00 0.00 0 21.24 0.00 0.00 0 0 0 bfloat16 -1 22.07 0.00 0.00 0 21.44 0.00 0.00 0 32 4 bfloat16 -1 35.16 0.00 0.00 8 21.60 0.00 0.00 8 64 8 bfloat16 -1 35.77 0.00 0.00 16 21.86 0.00 0.00 16 128 16 bfloat16 -1 35.35 0.00 0.00 32 21.71 0.00 0.00 32 256 32 bfloat16 -1 35.41 0.01 0.01 64 21.86 0.01 0.01 64 512 64 bfloat16 -1 35.42 0.01 0.01 128 21.48 0.02 0.02 128 1024 128 bfloat16 -1 35.66 0.02 0.02 256 21.54 0.04 0.04 256 2048 256 bfloat16 -1 35.52 0.04 0.04 512 21.43 0.07 0.07 512 4096 512 bfloat16 -1 35.36 0.09 0.09 1024 21.80 0.14 0.14 1024 8192 1024 bfloat16 -1 35.26 0.17 0.17 2048 21.96 0.28 0.28 2048 16384 2048 bfloat16 -1 35.74 0.34 0.34 4861 21.87 0.56 0.56 4287 32768 4096 bfloat16 -1 35.38 0.69 0.69 8192 21.33 1.15 1.15 8192 65536 8192 bfloat16 -1 35.80 1.37 1.37 20861 22.18 2.22 2.22 22528 131072 16384 bfloat16 -1 35.93 2.74 2.74 65532 21.96 4.48 4.48 65020 262144 32768 bfloat16 -1 36.83 5.34 5.34 131068 23.86 8.24 8.24 131068 524288 65536 bfloat16 -1 39.81 9.88 9.88 262136 26.38 14.90 14.90 262136 1048576 131072 bfloat16 -1 47.17 16.67 16.67 524280 34.45 22.83 22.83 524280 2097152 262144 bfloat16 -1 54.37 28.93 28.93 1.0227e+06 41.21 38.17 38.17 1.00676e+06 4194304 524288 bfloat16 -1 71.91 43.75 43.75 2.07766e+06 66.48 47.32 47.32 2.0399e+06 8388608 1048576 bfloat16 -1 120.3 52.29 52.29 4.15315e+06 112.8 55.77 55.77 4.17081e+06 16777216 2097152 bfloat16 -1 182.6 68.91 68.91 7.47879e+06 169.3 74.34 74.34 7.29921e+06 33554432 4194304 bfloat16 -1 301.5 83.46 83.46 9.95825e+06 285.0 88.31 88.31 8.48794e+06 67108864 8388608 bfloat16 -1 546.2 92.15 92.15 1.66993e+07 515.2 97.68 97.68 1.6777e+07 134217728 16777216 bfloat16 -1 1040.4 96.75 96.75 3.12366e+07 974.8 103.27 103.27 3.23955e+07 268435456 33554432 bfloat16 -1 2020.9 99.62 99.62 6.71078e+07 1896.1 106.18 106.18 5.1133e+07 536870912 67108864 bfloat16 -1 3945.0 102.07 102.07 1.17161e+08 3717.1 108.33 108.33 1.24887e+08 1073741824 134217728 bfloat16 -1 7788.9 103.39 103.39 1.66208e+08 7340.7 109.70 109.70 1.59109e+08 2147483648 268435456 bfloat16 -1 15476 104.07 104.07 2.90276e+08 14612 110.23 110.23 2.87382e+08 4294967296 536870912 bfloat16 -1 30821 104.52 104.52 5.64102e+08 29108 110.66 110.66 5.55923e+08 8589934592 1073741824 bfloat16 -1 61457 104.83 104.83 1.04678e+09 58151 110.79 110.79 1.0145e+09 # Out of bounds values : 58 FAILED # Avg bus bandwidth : 37.7068 # ```
single-node, multi-process -> success ✅ ```bash $ srun -N 1 -n 4 --gpus-per-node 4 -p gpu bash -c "module load 2023 NCCL/2.19.3-GCCcore-12.3.0-CUDA-12.1.1 OpenMPI/4.1.5-GCC-12.3.0; export OMPI_MCA_pml=ucx; export NCCL_P2P_DIRECT_DISABLE=1; ./build/hypercube_perf --nthreads 1 --ngpus 1 --datatype bfloat16 --minbytes 8 --maxbytes 8G --stepfactor 2" [gcn63.local.snellius.surf.nl:3160580] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160580] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160580] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160580] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160584] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160584] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160584] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160584] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160582] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160582] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160582] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160582] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160586] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160586] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3160586] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3160586] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 # nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 3160582 on gcn63 device 0 [0x31] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 3160584 on gcn63 device 1 [0x32] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 3160580 on gcn63 device 2 [0xca] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 3160586 on gcn63 device 3 [0xe3] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 bfloat16 -1 16.22 0.00 0.00 0 16.32 0.00 0.00 0 0 0 bfloat16 -1 16.30 0.00 0.00 0 16.31 0.00 0.00 0 32 4 bfloat16 -1 21.14 0.00 0.00 0 18.20 0.00 0.00 0 64 8 bfloat16 -1 21.04 0.00 0.00 0 18.33 0.00 0.00 0 128 16 bfloat16 -1 21.00 0.00 0.00 0 18.28 0.01 0.01 0 256 32 bfloat16 -1 21.11 0.01 0.01 0 18.36 0.01 0.01 0 512 64 bfloat16 -1 21.11 0.02 0.02 0 18.12 0.02 0.02 0 1024 128 bfloat16 -1 21.21 0.04 0.04 0 18.20 0.04 0.04 0 2048 256 bfloat16 -1 21.11 0.07 0.07 0 18.41 0.08 0.08 0 4096 512 bfloat16 -1 21.43 0.14 0.14 0 18.70 0.16 0.16 0 8192 1024 bfloat16 -1 22.16 0.28 0.28 0 19.30 0.32 0.32 0 16384 2048 bfloat16 -1 23.46 0.52 0.52 0 20.78 0.59 0.59 0 32768 4096 bfloat16 -1 26.49 0.93 0.93 0 23.47 1.05 1.05 0 65536 8192 bfloat16 -1 32.49 1.51 1.51 0 29.91 1.64 1.64 0 131072 16384 bfloat16 -1 39.03 2.52 2.52 0 36.42 2.70 2.70 0 262144 32768 bfloat16 -1 43.57 4.51 4.51 0 40.49 4.86 4.86 0 524288 65536 bfloat16 -1 45.61 8.62 8.62 0 42.52 9.25 9.25 0 1048576 131072 bfloat16 -1 51.09 15.39 15.39 0 49.47 15.90 15.90 0 2097152 262144 bfloat16 -1 68.77 22.87 22.87 0 65.83 23.89 23.89 0 4194304 524288 bfloat16 -1 106.4 29.56 29.56 0 99.68 31.56 31.56 0 8388608 1048576 bfloat16 -1 170.6 36.87 36.87 0 166.6 37.76 37.76 0 16777216 2097152 bfloat16 -1 271.0 46.43 46.43 0 261.0 48.21 48.21 0 33554432 4194304 bfloat16 -1 442.2 56.91 56.91 0 426.0 59.08 59.08 0 67108864 8388608 bfloat16 -1 788.5 63.83 63.83 0 758.3 66.38 66.38 0 134217728 16777216 bfloat16 -1 1486.7 67.71 67.71 0 1421.4 70.82 70.82 0 268435456 33554432 bfloat16 -1 2873.2 70.07 70.07 0 2754.3 73.10 73.10 0 536870912 67108864 bfloat16 -1 5649.0 71.28 71.28 0 5404.6 74.50 74.50 0 1073741824 134217728 bfloat16 -1 11157 72.18 72.18 0 10714 75.16 75.16 0 2147483648 268435456 bfloat16 -1 22219 72.49 72.49 0 21333 75.50 75.50 0 4294967296 536870912 bfloat16 -1 44273 72.76 72.76 0 42578 75.65 75.65 0 8589934592 1073741824 bfloat16 -1 88266 72.99 72.99 0 84896 75.89 75.89 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 26.0428 # ```
multi-node, single-process -> fail ❌ ```bash $ srun -N 2 -n 2 --gpus-per-node 4 -p gpu bash -c "module load 2023 NCCL/2.19.3-GCCcore-12.3.0-CUDA-12.1.1 OpenMPI/4.1.5-GCC-12.3.0; export OMPI_MCA_pml=ucx; export NCCL_P2P_DIRECT_DISABLE=1; ./build/hypercube_perf --nthreads 1 --ngpus 4 --datatype bfloat16 --minbytes 8 --maxbytes 8G --stepfactor 2" [gcn63.local.snellius.surf.nl:3159061] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159061] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159061] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159061] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:599624] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:599624] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:599624] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:599624] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 # nThread 1 nGpus 4 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 3159061 on gcn63 device 0 [0x31] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 3159061 on gcn63 device 1 [0x32] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 3159061 on gcn63 device 2 [0xca] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 3159061 on gcn63 device 3 [0xe3] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 599624 on gcn67 device 0 [0x31] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 599624 on gcn67 device 1 [0x32] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 599624 on gcn67 device 2 [0xca] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 599624 on gcn67 device 3 [0xe3] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 bfloat16 -1 44.09 0.00 0.00 0 43.07 0.00 0.00 0 0 0 bfloat16 -1 43.58 0.00 0.00 0 43.43 0.00 0.00 0 0 0 bfloat16 -1 43.23 0.00 0.00 0 43.53 0.00 0.00 0 64 4 bfloat16 -1 64.38 0.00 0.00 67 54.53 0.00 0.00 60 128 8 bfloat16 -1 64.59 0.00 0.00 120 54.60 0.00 0.00 104 256 16 bfloat16 -1 64.15 0.00 0.00 288 54.81 0.00 0.00 208 512 32 bfloat16 -1 64.09 0.01 0.01 544 54.71 0.01 0.01 480 1024 64 bfloat16 -1 64.31 0.01 0.01 960 54.65 0.02 0.02 832 2048 128 bfloat16 -1 64.76 0.03 0.03 1984 55.23 0.03 0.03 1920 4096 256 bfloat16 -1 64.88 0.06 0.06 4288 56.19 0.06 0.06 3968 8192 512 bfloat16 -1 64.36 0.11 0.11 7808 55.15 0.13 0.13 8192 16384 1024 bfloat16 -1 67.77 0.21 0.21 16895 57.38 0.25 0.25 15872 32768 2048 bfloat16 -1 71.42 0.40 0.40 36320 60.34 0.48 0.48 36830 65536 4096 bfloat16 -1 82.18 0.70 0.70 88320 71.65 0.80 0.80 90496 131072 8192 bfloat16 -1 90.79 1.26 1.26 225706 79.77 1.44 1.44 225214 262144 16384 bfloat16 -1 106.5 2.15 2.15 371545 97.65 2.35 2.35 371785 524288 32768 bfloat16 -1 137.0 3.35 3.35 858982 127.8 3.59 3.59 866085 1048576 65536 bfloat16 -1 219.1 4.19 4.19 1.59566e+06 206.0 4.45 4.45 1.5949e+06 2097152 131072 bfloat16 -1 379.9 4.83 4.83 2.9037e+06 367.7 4.99 4.99 2.86306e+06 4194304 262144 bfloat16 -1 601.5 6.10 6.10 4.53285e+06 589.8 6.22 6.22 4.54885e+06 8388608 524288 bfloat16 -1 1151.3 6.38 6.38 7.5383e+06 1137.2 6.45 6.45 7.71987e+06 16777216 1048576 bfloat16 -1 2254.9 6.51 6.51 1.3962e+07 2231.3 6.58 6.58 1.43663e+07 33554432 2097152 bfloat16 -1 4457.8 6.59 6.59 2.67527e+07 4438.2 6.62 6.62 2.69578e+07 67108864 4194304 bfloat16 -1 8869.6 6.62 6.62 5.16585e+07 8813.1 6.66 6.66 4.93189e+07 134217728 8388608 bfloat16 -1 17653 6.65 6.65 9.75524e+07 17575 6.68 6.68 9.30762e+07 268435456 16777216 bfloat16 -1 35233 6.67 6.67 1.66879e+08 35149 6.68 6.68 1.68206e+08 536870912 33554432 bfloat16 -1 70527 6.66 6.66 3.19063e+08 70279 6.68 6.68 3.76867e+08 1073741824 67108864 bfloat16 -1 140822 6.67 6.67 5.7185e+08 140300 6.70 6.70 6.86224e+08 2147483648 134217728 bfloat16 -1 282321 6.66 6.66 1.12415e+09 281136 6.68 6.68 9.94191e+08 4294967296 268435456 bfloat16 -1 563883 6.66 6.66 2.56208e+09 562339 6.68 6.68 2.18811e+09 8589934592 536870912 bfloat16 -1 1128687 6.66 6.66 3.86578e+09 1123981 6.69 6.69 4.08524e+09 # Out of bounds values : 112 FAILED # Avg bus bandwidth : 3.13032 # ```
multi-node, multi-process -> success ✅ ```bash $ srun -N 2 -n 8 --gpus-per-node 4 -p gpu bash -c "module load 2023 NCCL/2.19.3-GCCcore-12.3.0-CUDA-12.1.1 OpenMPI/4.1.5-GCC-12.3.0; export OMPI_MCA_pml=ucx; export NCCL_P2P_DIRECT_DISABLE=1; ./build/hypercube_perf --nthreads 1 --ngpus 1 --datatype bfloat16 --minbytes 8 --maxbytes 8G --stepfactor 2" [gcn67.local.snellius.surf.nl:600229] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600229] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600229] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600225] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600225] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600225] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600225] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600228] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600228] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600228] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600228] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600229] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600231] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600231] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn67.local.snellius.surf.nl:600231] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn67.local.snellius.surf.nl:600231] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159777] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159777] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159779] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159779] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159779] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159779] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159777] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159777] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159783] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159783] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159782] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159782] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159783] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159783] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 [gcn63.local.snellius.surf.nl:3159782] PMIX ERROR: BAD-PARAM in file base/bfrop_base_unpack.c at line 692 [gcn63.local.snellius.surf.nl:3159782] PMIX ERROR: BAD-PARAM in file dstore_base.c at line 2225 # nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 3159777 on gcn63 device 0 [0x31] NVIDIA A100-SXM4-40GB # Rank 1 Group 0 Pid 3159783 on gcn63 device 1 [0x32] NVIDIA A100-SXM4-40GB # Rank 2 Group 0 Pid 3159779 on gcn63 device 2 [0xca] NVIDIA A100-SXM4-40GB # Rank 3 Group 0 Pid 3159782 on gcn63 device 3 [0xe3] NVIDIA A100-SXM4-40GB # Rank 4 Group 0 Pid 600231 on gcn67 device 0 [0x31] NVIDIA A100-SXM4-40GB # Rank 5 Group 0 Pid 600228 on gcn67 device 1 [0x32] NVIDIA A100-SXM4-40GB # Rank 6 Group 0 Pid 600229 on gcn67 device 2 [0xca] NVIDIA A100-SXM4-40GB # Rank 7 Group 0 Pid 600225 on gcn67 device 3 [0xe3] NVIDIA A100-SXM4-40GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 0 0 bfloat16 -1 98.95 0.00 0.00 0 98.72 0.00 0.00 0 0 0 bfloat16 -1 98.91 0.00 0.00 0 98.87 0.00 0.00 0 0 0 bfloat16 -1 98.94 0.00 0.00 0 98.84 0.00 0.00 0 64 4 bfloat16 -1 128.5 0.00 0.00 0 117.5 0.00 0.00 0 128 8 bfloat16 -1 126.9 0.00 0.00 0 117.6 0.00 0.00 0 256 16 bfloat16 -1 128.5 0.00 0.00 0 118.4 0.00 0.00 0 512 32 bfloat16 -1 126.9 0.00 0.00 0 117.9 0.00 0.00 0 1024 64 bfloat16 -1 128.0 0.01 0.01 0 117.8 0.01 0.01 0 2048 128 bfloat16 -1 126.2 0.01 0.01 0 118.5 0.02 0.02 0 4096 256 bfloat16 -1 128.9 0.03 0.03 0 119.6 0.03 0.03 0 8192 512 bfloat16 -1 129.0 0.06 0.06 0 121.0 0.06 0.06 0 16384 1024 bfloat16 -1 135.4 0.11 0.11 0 127.4 0.11 0.11 0 32768 2048 bfloat16 -1 147.7 0.19 0.19 0 141.0 0.20 0.20 0 65536 4096 bfloat16 -1 169.4 0.34 0.34 0 160.7 0.36 0.36 0 131072 8192 bfloat16 -1 190.5 0.60 0.60 0 183.2 0.63 0.63 0 262144 16384 bfloat16 -1 247.2 0.93 0.93 0 235.8 0.97 0.97 0 524288 32768 bfloat16 -1 314.0 1.46 1.46 0 303.0 1.51 1.51 0 1048576 65536 bfloat16 -1 444.4 2.06 2.06 0 431.0 2.13 2.13 0 2097152 131072 bfloat16 -1 692.2 2.65 2.65 0 679.2 2.70 2.70 0 4194304 262144 bfloat16 -1 1099.5 3.34 3.34 0 1085.2 3.38 3.38 0 8388608 524288 bfloat16 -1 2056.4 3.57 3.57 0 2057.9 3.57 3.57 0 16777216 1048576 bfloat16 -1 3969.4 3.70 3.70 0 3944.9 3.72 3.72 0 33554432 2097152 bfloat16 -1 7766.5 3.78 3.78 0 7732.5 3.80 3.80 0 67108864 4194304 bfloat16 -1 15350 3.83 3.83 0 15329 3.83 3.83 0 134217728 8388608 bfloat16 -1 30589 3.84 3.84 0 30504 3.85 3.85 0 268435456 16777216 bfloat16 -1 61093 3.84 3.84 0 60902 3.86 3.86 0 536870912 33554432 bfloat16 -1 122032 3.85 3.85 0 121729 3.86 3.86 0 1073741824 67108864 bfloat16 -1 244137 3.85 3.85 0 243680 3.86 3.86 0 2147483648 134217728 bfloat16 -1 488686 3.85 3.85 0 487916 3.85 3.85 0 4294967296 268435456 bfloat16 -1 977665 3.84 3.84 0 975742 3.85 3.85 0 8589934592 536870912 bfloat16 -1 1955272 3.84 3.84 0 1953018 3.85 3.85 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 1.73529 # ```
sjeaugey commented 7 months ago

I believe this is a known issue with that test. There could be others (like the test not working correctly on non-power of 2 number of ranks). I wouldn't use this test to validate NCCL operation.