Closed vaseline555 closed 2 years ago
Hi @vaseline555,
1) Is your dataset size divisible by the number of GPUs?
If so, there should be no difference in the behavior of DistributedSampler
and DistributedEvalSampler
.
2) Are you using any kind of communication between processes that requires synchronization, i.e., back-propagation?
DistributedEvalSampler
does not require any communication between processes and I don't think it will be the source of hanging.
If you are using other synchronization-based operations, they may expect the same dataset length per process.
For example, if your total dataset size is 5 and you are using 3 processes, GPU 0 and 1 will be processing the 2nd item while GPU 2 is done after the 1st iteration.
If you are using a synchronization-based operation, GPU 0 and 1 will be waiting for the response from GPU 2 which will never occur.
When I need to do backpropagation at test time for each item, I turn off synchronization.
self.model.model.G.require_backward_grad_sync = False # compute without DDP sync
Best, Seungjun
Dear @SeungjunNah,
Thank you for your detailed answers. Like you presumed, it is exactly the case 2 that I have faced: uneven inputs are provided across different ranks.
Though there's no typical synchronization operation like backward()
except .item()
or .detach().cpu()
,
the main problem is the position I called torch.distributed.barrier()
...
I called it at the end of every iteration, not the end of every epoch. Thus, when inputs of the rank having less inputs are depleted (which surely has less iterations than others), it escapes out the evaluation loop faster than others, thereby other ranks are hanging around the barrier...
I fixed it by replacing the barrier to other position (i.e., at the end of epoch), and now things are going well.
While Googling, many people have trouble with treating uneven inputs when using DDP.
(FYI: https://github.com/pytorch/pytorch/issues/38174; https://github.com/PyTorchLightning/pytorch-lightning/issues/3325; https://github.com/pytorch/pytorch/pull/72423), even though I tried using DDP.join()
context manager, yours finally worked as a solution. 👍
I would like to thank you again for sharing your implementation of DistributedEvalSampler
.
Have a nice day! Thank you.
Sincerely, Adam
How can I use DistributedEvalSampler when I have to use dist.all_gather() to collect results? Many thx!
@DaoD
I don't know where you want to call all_gather
but I do all_reduce
outside the loop.
In my case, all processes are independent and the communications are done after the loop to collect loss/metric statistics.
In train.py, I compute loss/metrics from the outputs here.
self.criterion(output, target)
Outside the loop, here, I call
self.criterion.normalize()
which is defined here with dist.all_reduce inside.
If you want to call all_gather
during the for loop, I think it will hang.
But then, that will be the case you need all processes to work together and that's not an expected use case of DistributedEvalSampler.
@SeungjunNah Thanks for your reply! I will try to use all_gather out of the data loop.
Dear author, Thank you at first for your great work!
I am trying to use your implementation of
DistributedEvalSampler
for an evaluation purpose, jointly with DDP. (withshuffle=Flase
and no calling ofset_epoch()
; after callingDistributedEvalSampler
for yielding test samples on evaluating a model, my program should be finished)At the end of the script, my program hangs with charging 100% of GPU utilization in all 2 of 3 GPUs. (the last device is soley terminated with no errors) When replaced with
DistributedSampler
, this is not occurred.I doubted it is because of the logging (e.g., Wandb) is occurred at rank 0 device, but it is not the root cause as it is still occurred when I turned off the logging tool.
I wonder if you could point out conditions that I missed, please? Thank you in advance.
Best, Adam