Inquiry about NCCL's Tree Algorithm Performance in Single and Dual Machine Scenarios

Dear NCCL Development Team,

I am currently exploring the performance of NCCL's tree algorithm for collective communication in both single and dual machine setups. I have observed that in my testing environment, the bandwidth for single machine testing is measured at 72/125, while in the dual machine setup, it reaches 83/160. My GPU server environment consists of 8 H800 GPUs and 8 CX7 network cards, with GPU-to-GPU connections facilitated by NVLink.

I am curious about the reasons behind the observed performance improvement in the dual machine scenario compared to the single machine scenario. Specifically, I would like to understand the factors contributing to this performance difference, considering the hardware setup and the characteristics of NCCL's tree algorithm.

Could you kindly provide insights or explanations regarding why the dual machine setup demonstrates superior performance in the context of NCCL's tree algorithm for collective communication, taking into account the bandwidth measurements and the hardware configuration mentioned above?

Your expertise and guidance on this matter would be greatly appreciated. Thank you for your time and assistance.

Sincerely, fizzlove

NVIDIA / nccl

Inquiry about NCCL's Tree Algorithm Performance in Single and Dual Machine Scenarios #1290