NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

AlltoAllGetBw is incorrect when used with multiple nodes #181

Open sukoncon opened 9 months ago

sukoncon commented 9 months ago

AlltoAllGetBw function is not correct in nccl-tests/src/alltoall.cu void AlltoAllGetBw(size_t count, int typesize, double sec, double algBw, double busBw, int nranks) { double baseBw = (double)(count nranks typesize) / 1.0E9 / sec;

algBw = baseBw; double factor = ((double)(nranks-1))/((double)(nranks)); busBw = baseBw * factor; }

When I detect the busBw with 2 nodes, each with 8 GPUs, the calculated busBw will only be 3.92 GB/s, while the actual busBw is around 12.5 GB/s.

I believe the correct formula for calculating busBw for multiple nodes is *[(data in a node) (number of nodes - 1)] / (number of nodes) / time**.

sjeaugey commented 9 months ago

The NCCL tests do not pretend to detect the topology and compute the real BusBw of each component (Nvlink, PCI, network, etc.). The notion of BusBw is only there to reflect that collective operations, when implemented with point-to-point send/recv, do not necessarily need to transfer exactly the size but may need less or more. On heterogeneous architecture the BusBw may need some interpretation, or may even not make sense. It does make sense on homogeneous cases though.