Oneflow-Inc / DLPerf

DeepLearning Framework Performance Profiling Toolkit
Apache License 2.0
275 stars 27 forks source link

Wrong node_rank for multinodes in the distributed bash code #113

Closed slyviacassell closed 3 years ago

slyviacassell commented 3 years ago

Hi, I find that maybe your distributed bash scripts specify a wrong value for node_rank. This code may be wrong for launching the pytorch distributed training program. The node_rank would be different for each node. But I find that your multinodes bash script does not change it.

CMD="python3 -m torch.distributed.launch --nproc_per_node=$num_gpus --nnodes $num_nodes --node_rank=0  --master_addr=$master_node  --master_port=$master_port $CMD"
nlqq commented 3 years ago

Yep, if you use multi-node training, the node_rank is different for the specific node, and I illustrate to change this in the README.md image It needs you to modify manually because when using NCCL, you have to launch each script in each node manually in PyTorch, and one can never assign the node rank you specify in the cluster. In this way, it‘s better not to take this argument into the scripts. Or, you can give your advice, you're welcome.

slyviacassell commented 3 years ago

Oww, my fault. I will close this issue. Thank you for your prompt reply!

nlqq commented 3 years ago

Thank you for using my scripts~ Any questions are welcome~