BaguaSys / bagua

Bagua Speeds up PyTorch
https://tutorials-8ro.pages.dev/
MIT License
875 stars 83 forks source link

errors with DecentralizedAlgorithm in shift_one mode #380

Closed ProHuper closed 2 years ago

ProHuper commented 2 years ago

I used DecentralizedAlgorithm in shift_one peer_selection_mode with 8 GPUs, bagua backend says i have odd number ranks (only one), but you can see from the NCCL log that this job does have 8 GPUs. Does n_ranks here mean node number or gpu number exactly?

image

wangraying commented 2 years ago

nranks means the number of ranks in the NCCL communicator.

Decentralized algorithm will enable hierarchical reduce by default, which means only inter-node decentralized communication will be performed, with an intra-node allreduce before it and an intra-node bcast after it. To try it out on 8 GPUs, set hierarchical =False. See API for details.

ProHuper commented 2 years ago

got it!