what does error in nccl-test output represent?

blackgold commented 10 months ago

I ran nccl all reduce test with and without sharp and --check=1. Wanted to check if sharp introduces any errors. I was expecting a count of number of times the expected value of reduction op doesn't match the actual value. However I see 1e-06 (sharp) vs 2e-06(no sharp). How to interpret the error column?

128 GPU all reduce nccl-test without sharp 

#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        4096          1024     float     sum    93.73    0.04    0.09  1e-06    93.15    0.04    0.09  1e-06
        8192          2048     float     sum    96.51    0.08    0.17  1e-06    95.46    0.09    0.17  1e-06
       16384          4096     float     sum    96.70    0.17    0.34  1e-06    94.98    0.17    0.34  1e-06
       32768          8192     float     sum    97.72    0.34    0.67  1e-06    96.29    0.34    0.68  1e-06
       65536         16384     float     sum    99.81    0.66    1.30  1e-06    95.90    0.68    1.36  1e-06
      131072         32768     float     sum    120.9    1.08    2.15  1e-06    117.4    1.12    2.22  1e-06
      262144         65536     float     sum    136.0    1.93    3.82  1e-06    136.6    1.92    3.81  1e-06
      524288        131072     float     sum    152.3    3.44    6.83  1e-06    150.0    3.50    6.94  1e-06
     1048576        262144     float     sum    170.6    6.15   12.20  1e-06    170.9    6.14   12.18  1e-06
     2097152        524288     float     sum    211.5    9.91   19.67  1e-06    211.1    9.93   19.71  1e-06
     4194304       1048576     float     sum    284.8   14.73   29.22  1e-06    284.9   14.72   29.22  1e-06
     8388608       2097152     float     sum    337.3   24.87   49.35  1e-06    334.6   25.07   49.75  1e-06
    16777216       4194304     float     sum    451.2   37.18   73.78  1e-06    538.0   31.19   61.89  1e-06
    33554432       8388608     float     sum    969.1   34.62   68.71  1e-06    947.2   35.43   70.30  1e-06
    67108864      16777216     float     sum   1274.9   52.64  104.46  1e-06   1265.1   53.04  105.26  1e-06
   134217728      33554432     float     sum   2184.6   61.44  121.92  1e-06   2169.9   61.85  122.74  1e-06
   268435456      67108864     float     sum   3942.1   68.09  135.12  2e-06   4079.6   65.80  130.57  2e-06
   536870912     134217728     float     sum   7421.8   72.34  143.54  2e-06   7162.8   74.95  148.73  2e-06
  1073741824     268435456     float     sum    11704   91.74  182.06  2e-06    11721   91.61  181.79  2e-06
  2147483648     536870912     float     sum    22447   95.67  189.85  2e-06    22571   95.15  188.80  2e-06
  4294967296    1073741824     float     sum    44992   95.46  189.43  2e-06    44323   96.90  192.29  2e-06
  8589934592    2147483648     float     sum    88642   96.91  192.30  2e-06    89056   96.46  191.40  2e-06

128 GPU all reduce nccl-test with sharp

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        4096          1024     float     sum    72.01    0.06    0.11  1e-06    73.24    0.06    0.11  1e-06
        8192          2048     float     sum    74.97    0.11    0.22  1e-06    75.58    0.11    0.22  1e-06
       16384          4096     float     sum    146.7    0.11    0.22  1e-06    81.65    0.20    0.40  1e-06
       32768          8192     float     sum    89.25    0.37    0.73  1e-06    88.99    0.37    0.73  1e-06
       65536         16384     float     sum    98.76    0.66    1.32  1e-06    162.5    0.40    0.80  1e-06
      131072         32768     float     sum    119.3    1.10    2.18  1e-06    119.1    1.10    2.18  1e-06
      262144         65536     float     sum    146.9    1.78    3.54  1e-06    148.3    1.77    3.51  1e-06
      524288        131072     float     sum    149.1    3.52    6.98  1e-06    148.5    3.53    7.00  1e-06
     1048576        262144     float     sum    157.6    6.65   13.21  1e-06    156.4    6.71   13.31  1e-06
     2097152        524288     float     sum    198.0   10.59   21.02  1e-06    199.6   10.51   20.85  1e-06
     4194304       1048576     float     sum    215.8   19.44   38.57  1e-06    216.3   19.39   38.48  1e-06
     8388608       2097152     float     sum    268.0   31.30   62.11  1e-06    303.1   27.68   54.92  1e-06
    16777216       4194304     float     sum    323.0   51.95  103.08  1e-06    324.9   51.64  102.47  1e-06
    33554432       8388608     float     sum    431.8   77.71  154.20  1e-06    424.8   79.00  156.76  1e-06
    67108864      16777216     float     sum    729.2   92.03  182.62  1e-06    726.6   92.36  183.27  1e-06
   134217728      33554432     float     sum   1348.0   99.57  197.58  1e-06   1347.5   99.60  197.65  1e-06
   268435456      67108864     float     sum   2608.1  102.93  204.24  1e-06   2609.7  102.86  204.11  1e-06
   536870912     134217728     float     sum   5643.6   95.13  188.77  1e-06   4940.9  108.66  215.62  1e-06
  1073741824     268435456     float     sum   9648.3  111.29  220.84  1e-06   9637.9  111.41  221.07  1e-06
  2147483648     536870912     float     sum    19199  111.86  221.97  1e-06    19128  112.27  222.79  1e-06
  4294967296    1073741824     float     sum    37442  114.71  227.63  1e-06    37673  114.01  226.23  1e-06
  8589934592    2147483648     float     sum    73726  116.51  231.20  1e-06    73628  116.67  231.51  1e-06

sjeaugey commented 10 months ago

The default algorithm for large sizes is probably the ring algorithm which can result is significant rounding errors when we add the last elements with the total sum.

SHARP, in comparison is more like a tree algorithm which sums values or (more) equal weights, causing less deviation due to floating point rounding errors.

blackgold commented 10 months ago

what is the minimum value which can indicate there is some H/W or S/W errors involved?

sjeaugey commented 10 months ago

The value depends on many things, and basically you should rely on whether the test says "Errors : 0 OK" at the end or not.

On recent NCCL perf tests, this has been replaced by an error counter instead of an error margin, so this is no longer relevant.

NVIDIA / nccl-tests

what does error in nccl-test output represent? #176