Issues with disabled reduction during gradient accumulation

NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

BSD 3-Clause "New" or "Revised" License

8.41k stars 1.4k forks source link

Issues with disabled reduction during gradient accumulation #708

Closed krishansubudhi closed 4 years ago

krishansubudhi commented 4 years ago

I want to disable all-reduce during gradient accumulation. If my gradient accumulation is 2, I want to enable all reduce every other step. This will speed up my training.

Using this technique with Apex results in out of sync master gradients and the model does not converge well.

Detailed blog: https://krishansubudhi.github.io/deeplearning/2020/02/06/apex-gradient-accumulation.html

krishansubudhi commented 4 years ago

Found a solution to this issue. Details in this blog . https://krishansubudhi.github.io/deeplearning/2020/02/06/apex-gradient-accumulation.html