Update base_model.py - Githubissues

light201212 commented 3 months ago

Fix bug.Prevent GPU ID calculation errors when multiple machines have multiple GPUs

ChaoFan996 commented 3 months ago

Hi, would you open an issue to discuss this problem? Since we have no plan to train OpenGait in a multi-machine manner, it may be hard to justify your updates on the whole. Thanks for your insights.

light201212 commented 3 months ago

Hi, would you open an issue to discuss this problem? Since we have no plan to train OpenGait in a multi-machine manner, it may be hard to justify your updates on the whole. Thanks for your insights.

This is just a suggestion,if it only runs on one machine, there is no problem.But if you have two hosts A and B with 4 GPUs，train on two hosts using DDP, torch.distributed.get_rank() will get 0,1,2,3 on A,but 4,5,6,7 on B, get_rank() function in pytorch will get global rank.It will cause error,because for processes on B need get 0,1,2,3 GPU-ID.

ShiqiYu / OpenGait

Update base_model.py #194