Questions about algorithm tuning

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

2.99k stars 761 forks source link

Hi! Can you please answer some questions?

We are trying to optimize latency for Trees on big messages (4Gb) by changing NCCL_BUFFSIZE and chunkSizes

I looked at latency here Is it calculated for 256 kB ChunkSize and 4Mb BuffSize?

What does baseLat here refer to?

Also here we see how time is calculated https://github.com/NVIDIA/nccl/blob/master/src/graph/tuning.cc#L396. BW is busBW * ratio, but ncc-tests PERFORMANCE.md claims that busBW id similar for all algorithms. But here we see different bw for different algorithms, which idea is behind it?

Hello @annaa-ka

I'm also facing issue with NCCL_BUFFSIZE when transmitting large messages.

Is there and shareable progress or insights?

NVIDIA / nccl

Questions about algorithm tuning #1011