Describe the bug
A clear and concise description of what the bug is.
Programs get blocked when using multiple nodes. By setting export LOG_LEVEL=DEBUG, I can see that it got stuck at BaguaSingleCommunicator, since it prints
@zhaone
You need to set node_rank when using multiple nodes. But if you only use one node, you can ignore these parameters: node_rank/master_addr/master_port .
Describe the bug A clear and concise description of what the bug is.
Programs get blocked when using multiple nodes. By setting
export LOG_LEVEL=DEBUG
, I can see that it got stuck atBaguaSingleCommunicator
, since it prints2022-11-21T12:40:23.673510Z DEBUG bagua_core_internal::communicators: creating communicator, nccl_unique_id AgCwgcCQEwkAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=, rank 8, nranks 16, device_id 0, stream_ptr 94639511762624
but fail to print
al communicator initialized at XXX
.When I set
--node_rank=0
, the program can run smoothly.Environment
python3 -m pip install --pre bagua
)?: yesReproducing
Please provide a minimal working example. This means the runnable code.
Please also write what exact commands are required to reproduce your results.
Additional context Add any other context about the problem here.