NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

Is it safe or recommended to use multiple communicators for real distributed training #1520

Open ZhiyiHu1999 opened 3 days ago

ZhiyiHu1999 commented 3 days ago

Hello! In my parallelism strategy, for example with 4 nodes, 2GPUs per node. I hope to create a comm_0 for all 8 GPUs and comm_1 for 4 GPUs on node_0 and node_1 and comm_2 for the 4 other GPUs on node_2 and node_3.

If in my design, no collectives are intended to be concurrent for communicators consists of same GPU (Here, collectives on comm_0 and comm_1 not concurrent, but collectives on comm_1 and comm_2 may be concurrent) , is it a safe use of NCCL?

Also, does each communicator has its own Ring/Tree channels and need its own rank identifier from 0 to (nRanks_in_the_communicator - 1). Thanks a lot!