How is the maximum number of bytes for all_reduce operation calculated?

jxh314 commented 5 months ago

When testing in a single-machine dual-GPU environment, I received the following message: "# Reducing maxBytes to 12156305408 due to memory limitation." Approximately 11.32GB.

The GPUs used here are RTX 4090, and the output from nvidia-smi is as follows:

Each GPU has a memory size of 23.988GB. So, how is the maximum byte size for the reduction operation calculated? If it's the maximum memory size divided by 2, it should also be 11.99GB. Why is the maximum byte size for the reduction operation slightly larger at 11.32GB? Moreover, in a dual-machine, four-GPU environment, the maximum byte size prompted is also 11.32GB. Both machines have identical configurations, so I am quite confused about this. I would be very grateful if you could help me .

AddyLaddy commented 5 months ago

https://github.com/NVIDIA/nccl-tests/blob/c6afef0b6f76ffc55d4172d971be6cf5a08a73a4/src/common.cu#L914

You can set -c 0 to use larger buffers for the BW testing.

jxh314 commented 4 months ago

https://github.com/NVIDIA/nccl-tests/blob/c6afef0b6f76ffc55d4172d971be6cf5a08a73a4/src/common.cu#L914

You can set -c 0 to use larger buffers for the BW testing.您可以设置为 -c 0 使用更大的缓冲区进行 BW 测试。

Thanks a lot, and may I can ask why deduct 1GB, or does this 1GB serve any other purpose? @AddyLaddy

AddyLaddy commented 4 months ago

I think I just looked at empirical CUDA usage when running the NCCL tests on HGX/DGX and probably rounded up to 1GiB to try and avoid OOM if we attempt to allocate too much. Maybe on systems with fewer GPUs and no NVLink it will use less? Also compiling arch specific code will reduce the NCCL kernel sizes. You could also run nvidia-smi to look at typical memory usage during nccl-tests with small buffers sizes to get a better estimate on your system.

NVIDIA / nccl-tests

How is the maximum number of bytes for all_reduce operation calculated? #198