NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.99k stars 761 forks source link

Questions about algorithm tuning #1011

Open annaa-ka opened 9 months ago

annaa-ka commented 9 months ago

Hi! Can you please answer some questions?

We are trying to optimize latency for Trees on big messages (4Gb) by changing NCCL_BUFFSIZE  and chunkSizes

  1. I looked at latency here Is it calculated for 256 kB ChunkSize and 4Mb BuffSize?

  2. What does baseLat here refer to?

  3. Also here we see how time is calculated https://github.com/NVIDIA/nccl/blob/master/src/graph/tuning.cc#L396. BW is busBW * ratio, but ncc-tests PERFORMANCE.md claims that busBW id similar for all algorithms. But here we see different bw for different algorithms, which idea is behind it?

taekyounghan commented 3 months ago

Hi! Can you please answer some questions?

We are trying to optimize latency for Trees on big messages (4Gb) by changing NCCL_BUFFSIZE  and chunkSizes

  1. I looked at latency here Is it calculated for 256 kB ChunkSize and 4Mb BuffSize?
  2. What does baseLat here refer to?
  3. Also here we see how time is calculated https://github.com/NVIDIA/nccl/blob/master/src/graph/tuning.cc#L396. BW is busBW * ratio, but ncc-tests PERFORMANCE.md claims that busBW id similar for all algorithms. But here we see different bw for different algorithms, which idea is behind it?

Hello @annaa-ka

I'm also facing issue with NCCL_BUFFSIZE when transmitting large messages.

Is there and shareable progress or insights?