While testing in the upper boundaries of memory limit, we have noticed what I suppose is a bug in nccl-tests.
Memory allocations
From what I see, the test initialize in two steps:
allocate (in advance) buffers using "max bytes" size
call ncclCommInitAll(), and start working on the test
This "max bytes" is determined with a couple of factors:
collective operation we're using (each operation has its own way to calculate memory footprint: AllReduceGetCollByteCount(), ReduceScatterGetCollByteCount(), ...)
total device memory (cudaDeviceProp::totalGlobalMem)
CLI argument
Specifically, the test caps the max bytes with: (total device memory - 1GiB) / 3.
(assuming default settings where datacheck is enabled (-c=1))
This results in 27971332778 bytes (about 26.05 GiB) exact of limit on my machine.
Problem
However, we're seeing errors with high max_bytes (-e) parameters.
For example, in this A100 with 80GiB memory, giving -e 27766464617 (about 25.86 GiB) crashes the test all_reduce_perf, during the call to ncclCommInitAll().
The breaking point should be somewhere around that; -e 27066464617 (about 25.21 GiB) doesn't crash.
(Note that all_reduce_perf has one of the highest GPU memory footprint (link: all_reduce.cu). Some other tests divides the payload across ranks. Thus, the same parameter works alright for ./reduce_scatter_perf, for example.)
NCCL trace shows it's an OOM:
$ NCCL_DEBUG=WARN NCCL_DEBUG_SUBSYS=ALL ./all_reduce_perf -b 27766464617 -e 27766464617 -w 0 -n 1 -t 4
# nThread 4 nGpus 1 minBytes 27766464617 maxBytes 27766464617 step: 1048576(bytes) warmup iters: 0 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1206071 on <node name> device 0 [0x00] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 1206071 on <node name> device 1 [0x00] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 1206071 on <node name> device 2 [0x00] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 1206071 on <node name> device 3 [0x00] NVIDIA A100-SXM4-80GB
NCCL version 2.21.5+cuda12.5
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
27766464616 6941616154 float sum -1
<node name>:1206071:1206148 [1] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. <node name> pid 1206071: Test failure common.cu:377
.. <node name> pid 1206071: Test failure common.cu:413
.. <node name> pid 1206071: Test failure common.cu:603
.. <node name> pid 1206071: Test failure all_reduce.cu:90
.. <node name> pid 1206071: Test failure common.cu:615
<node name>:1206071:1206147 [2] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. <node name> pid 1206071: Test failure common.cu:377
.. <node name> pid 1206071: Test failure common.cu:413
.. <node name> pid 1206071: Test failure common.cu:603
.. <node name> pid 1206071: Test failure all_reduce.cu:90
.. <node name> pid 1206071: Test failure common.cu:615
With smaller NCCL_BUFFSIZE such as 65536 (64KiB, instead of the default 4194304 = 4 MiB), the test actually passes.
I am not sure whether NCCL is indeed loading as many as 256 buffers into the memory,
but with alignments, fragmentations, and whatnot, I wouldn't be surprised.
Maybe there's NVSwitch in play as well: more links, more buffers.
\ 256 = headroom 1 GiB / default buffsize 4 MiB
Changing that line (maxMem - (1<<30)) from 1<<30 to 1ull<<31 immediately solved the problem,
and it works in every cases, however absurd the max_bytes (-e) goes.
I think we either need a larger default (possibly increasing per GPU size, RAM, or count), make it parameterized, or warn the user about this.
Environment
This is on a node with 8 A100-SXM4-80GB GPUs, connected to 6 NVSwitches. I'm not familiar with its topology in detail, but I can tell:
nvidia-smi nvlink -s
);/var/log/fabricmanager.log
);NV12
innvidia-smi topo -m
).NCCL version:
2.21.5
nccl-tests
version:v2.13.9
(latest)Issue
While testing in the upper boundaries of memory limit, we have noticed what I suppose is a bug in
nccl-tests
.Memory allocations
From what I see, the test initialize in two steps:
ncclCommInitAll()
, and start working on the testThis "max bytes" is determined with a couple of factors:
AllReduceGetCollByteCount()
,ReduceScatterGetCollByteCount()
, ...)cudaDeviceProp::totalGlobalMem
)Specifically, the test caps the max bytes with:
(total device memory - 1GiB) / 3
. (assuming default settings where datacheck is enabled (-c=1
))https://github.com/NVIDIA/nccl-tests/blob/c6afef0b6f76ffc55d4172d971be6cf5a08a73a4/src/common.cu#L915
This results in
27971332778
bytes (about 26.05 GiB) exact of limit on my machine.Problem
However, we're seeing errors with high
max_bytes (-e)
parameters.For example, in this A100 with 80GiB memory, giving
-e 27766464617
(about 25.86 GiB) crashes the testall_reduce_perf
, during the call toncclCommInitAll()
.The breaking point should be somewhere around that;
-e 27066464617
(about 25.21 GiB) doesn't crash.(Note that
all_reduce_perf
has one of the highest GPU memory footprint (link:all_reduce.cu
). Some other tests divides the payload across ranks. Thus, the same parameter works alright for./reduce_scatter_perf
, for example.)NCCL trace shows it's an OOM:
With smaller
NCCL_BUFFSIZE
such as65536
(64KiB, instead of the default4194304
= 4 MiB), the test actually passes.I am not sure whether NCCL is indeed loading as many as 256 buffers into the memory,
but with alignments, fragmentations, and whatnot, I wouldn't be surprised. Maybe there's NVSwitch in play as well: more links, more buffers. \ 256 = headroom 1 GiB / default buffsize 4 MiB
Changing that line
(maxMem - (1<<30))
from1<<30
to1ull<<31
immediately solved the problem, and it works in every cases, however absurd themax_bytes (-e)
goes.I think we either need a larger default (possibly increasing per GPU size, RAM, or count), make it parameterized, or warn the user about this.