cybertronai / pytorch-lamb

Implementation of https://arxiv.org/abs/1904.00962
MIT License
369 stars 49 forks source link

Implementation question #5

Open dhpollack opened 5 years ago

dhpollack commented 5 years ago

I noticed that in your implementation you've clamped the values of the weight_norm to min of 0 and max of 10. I have seen this 10 before in other implementations and noticed that this comes from the first version of the lamb paper. However, this number refers to the trust_ratio and not the weight_norm. Have you done any further experiments with this or were you looking at other implementations of the paper and decided to use 10 for that reason. I also implemented lamb with both v1 and the latest version and I didn't notice a difference. Just wanted to know if you did additional testing or were aware of this issue.

8enmann commented 5 years ago

I haven't tried different values, but yes I took 10 from v1 of the paper and am aware of the discrepancy. I have tested the algorithm on a large scale language model and it seems to scale well. I've also tracked the values of the weight norm of different layers and didn't see a clear reason to use a number other than 10. Let me know if you experiment and find a better value!

Tony-Y commented 5 years ago

https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20#gistcomment-3010232

This comment on Ralamb may be helpful.

8enmann commented 5 years ago

@Tony-Y 's link shows a comment from the original author that they use identity function instead of clipping. Thanks Tony!

hitvoice commented 4 years ago

DeepSpeed trains Bert with LAMB and clips to [0.08,0.5]: https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/bert-pretraining.md#reproducing-bert-training-results-with-deepspeed

It's quite interesting and confusing that such different values are used in different implementations.

8enmann commented 4 years ago

Author open sourced theirs! https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py