Denys88 / rl_games

RL implementations
MIT License
800 stars 135 forks source link

fixed bug with multi-gpu training for continuous a2c in a2c_common.py #284

Closed tttonyalpha closed 1 month ago

tttonyalpha commented 2 months ago

Fixed bug with missed torch.cuda.set_device(self.local_rank), which causes a problem when two different parallel processes try to use the same GPU in multi-gpu training with continuous a2c implementation

ViktorM commented 1 month ago

@tttonyalpha thank you for PR. Could you briefly describe what was an issues with running parallel processes on the same GPU?

ViktorM commented 1 month ago

I just ran tests locally - training worked perfectly fine on a single GPU with running 2 different parallel processes.

tttonyalpha commented 1 month ago

Sorry for a long reply: this problem only occurs when using multiple GPUs, when I tried to run continuous a2c training on 2 GPU using: torchrun --standalone --nnodes=1 --nproc_per_node=2 with multi_gpu=True in config, I have got an error: NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device, this occurs because according to torch DDP documentation for N GPU parallelization you have to create N processes and each process must run exclusively on a single GPU. When we don't call torch.cuda.set_device(self.local_rank) before parameters broadcasting, torch tries to use same GPU for each process (rank 0 and rank 1 in my case), which causes error

denysm88 commented 1 month ago

No problem, Thanks Ill merge it later this week.