Environment

This is on a node with 8 A100-SXM4-80GB GPUs, connected to 6 NVSwitches. I'm not familiar with its topology in detail, but I can tell:

each GPU has 12 (25 GB/s each) NVLink connections (in nvidia-smi nvlink -s);
each NVSwitch has 16 NVLink connections, 2 per GPU (in /var/log/fabricmanager.log);
every pair of GPU has a working NVLink-based connection (all NV12 in nvidia-smi topo -m).

NCCL version: 2.21.5 nccl-tests version: v2.13.9 (latest)

Issue

While testing in the upper boundaries of memory limit, we have noticed what I suppose is a bug in nccl-tests.

Memory allocations

From what I see, the test initialize in two steps:

allocate (in advance) buffers using "max bytes" size
call ncclCommInitAll(), and start working on the test

This "max bytes" is determined with a couple of factors:

collective operation we're using (each operation has its own way to calculate memory footprint: AllReduceGetCollByteCount(), ReduceScatterGetCollByteCount(), ...)
total device memory (cudaDeviceProp::totalGlobalMem)
CLI argument

Specifically, the test caps the max bytes with: (total device memory - 1GiB) / 3. (assuming default settings where datacheck is enabled (-c=1))

https://github.com/NVIDIA/nccl-tests/blob/c6afef0b6f76ffc55d4172d971be6cf5a08a73a4/src/common.cu#L915

This results in 27971332778 bytes (about 26.05 GiB) exact of limit on my machine.

Problem

However, we're seeing errors with high max_bytes (-e) parameters.
For example, in this A100 with 80GiB memory, giving -e 27766464617 (about 25.86 GiB) crashes the test all_reduce_perf, during the call to ncclCommInitAll().
The breaking point should be somewhere around that; -e 27066464617 (about 25.21 GiB) doesn't crash.

(Note that all_reduce_perf has one of the highest GPU memory footprint (link: all_reduce.cu). Some other tests divides the payload across ranks. Thus, the same parameter works alright for ./reduce_scatter_perf, for example.)

NCCL trace shows it's an OOM:

$ NCCL_DEBUG=WARN NCCL_DEBUG_SUBSYS=ALL ./all_reduce_perf -b 27766464617 -e 27766464617 -w 0 -n 1 -t 4
# nThread 4 nGpus 1 minBytes 27766464617 maxBytes 27766464617 step: 1048576(bytes) warmup iters: 0 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 1206071 on <node name> device  0 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 1206071 on <node name> device  1 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 1206071 on <node name> device  2 [0x00] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 1206071 on <node name> device  3 [0x00] NVIDIA A100-SXM4-80GB
NCCL version 2.21.5+cuda12.5
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 27766464616    6941616154     float     sum      -1
<node name>:1206071:1206148 [1] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. <node name> pid 1206071: Test failure common.cu:377
 .. <node name> pid 1206071: Test failure common.cu:413
 .. <node name> pid 1206071: Test failure common.cu:603
 .. <node name> pid 1206071: Test failure all_reduce.cu:90
 .. <node name> pid 1206071: Test failure common.cu:615

<node name>:1206071:1206147 [2] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
 .. <node name> pid 1206071: Test failure common.cu:377
 .. <node name> pid 1206071: Test failure common.cu:413
 .. <node name> pid 1206071: Test failure common.cu:603
 .. <node name> pid 1206071: Test failure all_reduce.cu:90
 .. <node name> pid 1206071: Test failure common.cu:615

With smaller NCCL_BUFFSIZE such as 65536 (64KiB, instead of the default 4194304 = 4 MiB), the test actually passes.

I am not sure whether NCCL is indeed loading as many as 256 buffers into the memory,
but with alignments, fragmentations, and whatnot, I wouldn't be surprised. Maybe there's NVSwitch in play as well: more links, more buffers. \ 256 = headroom 1 GiB / default buffsize 4 MiB

Changing that line (maxMem - (1<<30)) from 1<<30 to 1ull<<31 immediately solved the problem, and it works in every cases, however absurd the max_bytes (-e) goes.

I think we either need a larger default (possibly increasing per GPU size, RAM, or count), make it parameterized, or warn the user about this.

NVIDIA / nccl-tests

1 GiB headroom might be too small #220

Environment

Issue

Memory allocations

Problem