Closed ProHuper closed 2 years ago
nranks
means the number of ranks in the NCCL communicator.
Decentralized algorithm will enable hierarchical reduce by default, which means only inter-node decentralized communication will be performed, with an intra-node allreduce before it and an intra-node bcast after it. To try it out on 8 GPUs, set hierarchical =False
. See API for details.
got it!
I used DecentralizedAlgorithm in shift_one peer_selection_mode with 8 GPUs, bagua backend says i have odd number ranks (only one), but you can see from the NCCL log that this job does have 8 GPUs. Does n_ranks here mean node number or gpu number exactly?