Since average_metrics function is being called from backward that should be called from every gpu, the devise that is being created here should be equal to the current rank, otherwise torch.distributed.all_reduce will be stucked forever.
We use Dora for all of our experiments which typically calls torch.cuda.set_device at the beginning of the training with the proper device. That allows to use 'cuda' everywhere after without worrying about the rank of the gpu.
🐛 Bug Report
Since
average_metrics
function is being called frombackward
that should be called from every gpu, thedevise
that is being created here should be equal to the current rank, otherwisetorch.distributed.all_reduce
will be stucked forever.