Closed tttonyalpha closed 1 month ago
@tttonyalpha thank you for PR. Could you briefly describe what was an issues with running parallel processes on the same GPU?
I just ran tests locally - training worked perfectly fine on a single GPU with running 2 different parallel processes.
Sorry for a long reply: this problem only occurs when using multiple GPUs, when I tried to run continuous a2c training on 2 GPU using: torchrun --standalone --nnodes=1 --nproc_per_node=2
with multi_gpu=True
in config, I have got an error: NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device
, this occurs because according to torch DDP documentation for N GPU parallelization you have to create N processes and each process must run exclusively on a single GPU. When we don't call torch.cuda.set_device(self.local_rank)
before parameters broadcasting, torch tries to use same GPU for each process (rank 0 and rank 1 in my case), which causes error
No problem, Thanks Ill merge it later this week.
Fixed bug with missed torch.cuda.set_device(self.local_rank), which causes a problem when two different parallel processes try to use the same GPU in multi-gpu training with continuous a2c implementation