ShiqiYu / OpenGait

A flexible and extensible framework for gait recognition. You can focus on designing your own models and comparing with state-of-the-arts easily with the help of OpenGait.
664 stars 154 forks source link

Update base_model.py #194

Closed light201212 closed 3 months ago

light201212 commented 3 months ago

Fix bug.Prevent GPU ID calculation errors when multiple machines have multiple GPUs

ChaoFan996 commented 3 months ago

Hi, would you open an issue to discuss this problem? Since we have no plan to train OpenGait in a multi-machine manner, it may be hard to justify your updates on the whole. Thanks for your insights.

light201212 commented 3 months ago

Hi, would you open an issue to discuss this problem? Since we have no plan to train OpenGait in a multi-machine manner, it may be hard to justify your updates on the whole. Thanks for your insights.

This is just a suggestion,if it only runs on one machine, there is no problem.But if you have two hosts A and B with 4 GPUs,train on two hosts using DDP, torch.distributed.get_rank() will get 0,1,2,3 on A,but 4,5,6,7 on B, get_rank() function in pytorch will get global rank.It will cause error,because for processes on B need get 0,1,2,3 GPU-ID.