Open dhpollack opened 5 years ago
I haven't tried different values, but yes I took 10
from v1 of the paper and am aware of the discrepancy. I have tested the algorithm on a large scale language model and it seems to scale well. I've also tracked the values of the weight norm of different layers and didn't see a clear reason to use a number other than 10. Let me know if you experiment and find a better value!
https://gist.github.com/redknightlois/c4023d393eb8f92bb44b2ab582d7ec20#gistcomment-3010232
This comment on Ralamb
may be helpful.
@Tony-Y 's link shows a comment from the original author that they use identity function instead of clipping. Thanks Tony!
DeepSpeed trains Bert with LAMB and clips to [0.08,0.5]: https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/bert-pretraining.md#reproducing-bert-training-results-with-deepspeed
It's quite interesting and confusing that such different values are used in different implementations.
Author open sourced theirs! https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py
I noticed that in your implementation you've clamped the values of the
weight_norm
to min of0
and max of10
. I have seen this10
before in other implementations and noticed that this comes from the first version of the lamb paper. However, this number refers to thetrust_ratio
and not theweight_norm
. Have you done any further experiments with this or were you looking at other implementations of the paper and decided to use10
for that reason. I also implemented lamb with both v1 and the latest version and I didn't notice a difference. Just wanted to know if you did additional testing or were aware of this issue.