Summary:
We were unnecessarily syncing the gradients during gradient accumulation steps where we didn't perform an optimizer step. This removes the communication bottleneck at every step, which only happens at the end of every optimizer_period
This should make a significant speed difference, especially when optimizer_period and the total nodes are high
Summary: We were unnecessarily syncing the gradients during gradient accumulation steps where we didn't perform an optimizer step. This removes the communication bottleneck at every step, which only happens at the end of every
optimizer_period
This should make a significant speed difference, especially when
optimizer_period
and the total nodes are highDifferential Revision: D24969761