Closed yaroslavvb closed 6 years ago
Yup the gradients are being synchronized, but I believe it's the model (which is in fp16) parameters - https://github.com/pytorch/pytorch/blob/a77b391de723a69fb59ff6ae9d1236ca93f03a97/torch/nn/parallel/distributed.py#L310
Here's my limited understanding of what is happening: Our code keeps 2 copies of the gradients. One in half precision (model gradients), and the other in full precision (optimizer gradients).
But you are right, I should probably test this instead of theorizing
Side note: The optimizer/update step needs to be in fp32 to accommodate very small numbers - gradients * .00001 learning rate = very small number not supported by fp16. Maybe you'll get NaN's too - https://discuss.pytorch.org/t/adam-half-precision-nans/1765/7
The cool thing about this is that sometimes you don't care about those really small gradient updates. In this case, we don't keep a separate full precision copy of the gradients and we get a higher accuracy: https://github.com/diux-dev/cluster/blob/master/pytorch-cifar/train_cifar10_bkj.py#L198 It kind of acts as a regularizer
Interesting about being a regularizer. Actually "Deep Gradient Compression" also zeros out smallest 99.9% of gradient entries, but then it keeps track of the errors and eventually adds them in, similar as in this paper. Feel free to close if you think this is resolved.
from bandwidth numbers it seems we are doing fp16
This line suggests that gradients are getting synchronized, which are fp32 instead of fp16 @bearpelican
https://github.com/pytorch/pytorch/blob/a77b391de723a69fb59ff6ae9d1236ca93f03a97/torch/nn/parallel/distributed.py#L340