Closed light201212 closed 3 months ago
Hi, would you open an issue to discuss this problem? Since we have no plan to train OpenGait in a multi-machine manner, it may be hard to justify your updates on the whole. Thanks for your insights.
Hi, would you open an issue to discuss this problem? Since we have no plan to train OpenGait in a multi-machine manner, it may be hard to justify your updates on the whole. Thanks for your insights.
This is just a suggestion,if it only runs on one machine, there is no problem.But if you have two hosts A and B with 4 GPUs,train on two hosts using DDP, torch.distributed.get_rank() will get 0,1,2,3 on A,but 4,5,6,7 on B, get_rank() function in pytorch will get global rank.It will cause error,because for processes on B need get 0,1,2,3 GPU-ID.
Fix bug.Prevent GPU ID calculation errors when multiple machines have multiple GPUs