The file src/graph/tuning.cc contains a model that computes how long GPU collectives using NCCL should take. To compute this, "ncclTopoTuneModel" requires information about hardware and base latencies, which are given in units of microseconds earlier in the same file:
There's no architecture dependence for the hwLat figures, in contrast to the bandwidth numbers that follow them in the file. Is this because the latencies do not depend on the architecture of the GPU (i.e. Volta/Ampere/Hopper)?
If the latency numbers do not depend on architecture, then what are they based on instead? My guess is that at least the network latencies should have something to do with what the InfiniBand technology that is used is able to support, but I'm not totally sure that this is the bottlenecking factor. If possible, I'd like to have a better understanding of the lower-level steps involved in executing a GPU collective like all-reduce using NCCL, and which individual steps add up to produce the baseLat and hwLat numbers quoted above.
How have these latency numbers changed over time? For example, how much faster is it to reduce a trivial amount of information (e.g. 8 bytes) across a node of 8 H100s today compared to a node of 8 A100s in 2020 or 2021?
If there's a reference which explains this in detail, I would be happy if someone could direct me to it.
The file src/graph/tuning.cc contains a model that computes how long GPU collectives using NCCL should take. To compute this, "ncclTopoTuneModel" requires information about hardware and base latencies, which are given in units of microseconds earlier in the same file:
I had a few questions about these numbers:
If there's a reference which explains this in detail, I would be happy if someone could direct me to it.