synchronizing in fp16 instead of fp32

yaroslavvb commented 6 years ago

This line suggests that gradients are getting synchronized, which are fp32 instead of fp16 @bearpelican

https://github.com/pytorch/pytorch/blob/a77b391de723a69fb59ff6ae9d1236ca93f03a97/torch/nn/parallel/distributed.py#L340

bearpelican commented 6 years ago

Yup the gradients are being synchronized, but I believe it's the model (which is in fp16) parameters - https://github.com/pytorch/pytorch/blob/a77b391de723a69fb59ff6ae9d1236ca93f03a97/torch/nn/parallel/distributed.py#L310

Here's my limited understanding of what is happening: Our code keeps 2 copies of the gradients. One in half precision (model gradients), and the other in full precision (optimizer gradients).

Half precision model gradients is used/calculated for the backward step - https://github.com/diux-dev/cluster/blob/master/pytorch/training/train_imagenet_nv.py#L317
Distributed model syncs those gradients (fp16) - https://github.com/pytorch/pytorch/blob/a77b391de723a69fb59ff6ae9d1236ca93f03a97/torch/nn/parallel/distributed.py#L340 Seems to suggest _register_nccl_grad_hook happens at backward pass - https://discuss.pytorch.org/t/what-is-variable-execution-engine-queue-callback/16823
We copy over model gradients (fp16) to optimizer gradients (fp32) - https://github.com/diux-dev/cluster/blob/master/pytorch/training/train_imagenet_nv.py#L318
Optimizer step on fp32 gradients and copy new gradients back to model gradients (fp16): https://github.com/diux-dev/cluster/blob/master/pytorch/training/train_imagenet_nv.py#L321

bearpelican commented 6 years ago

But you are right, I should probably test this instead of theorizing

bearpelican commented 6 years ago

Side note: The optimizer/update step needs to be in fp32 to accommodate very small numbers - gradients * .00001 learning rate = very small number not supported by fp16. Maybe you'll get NaN's too - https://discuss.pytorch.org/t/adam-half-precision-nans/1765/7

The cool thing about this is that sometimes you don't care about those really small gradient updates. In this case, we don't keep a separate full precision copy of the gradients and we get a higher accuracy: https://github.com/diux-dev/cluster/blob/master/pytorch-cifar/train_cifar10_bkj.py#L198 It kind of acts as a regularizer

yaroslavvb commented 6 years ago

Interesting about being a regularizer. Actually "Deep Gradient Compression" also zeros out smallest 99.9% of gradient entries, but then it keeps track of the errors and eventually adds them in, similar as in this paper. Feel free to close if you think this is resolved.

yaroslavvb commented 6 years ago

from bandwidth numbers it seems we are doing fp16

diux-dev / cluster

synchronizing in fp16 instead of fp32 #15