Closed xiaoda99 closed 5 years ago
The original paper used (0.9, 0.98): https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
We never tuned this hparam, so am not sure how much of a difference it makes compared to the pytorch default of 0.998.
Hi guys,
In en-de and en-fr transformer-big examples, adam_betas is set to (0.9, 0.98), which is the setting used in the Vaswani et al. 2017 paper. However, in the latest tensor2tensor repo, Adam betas is set to (0.9, 0.997), which is also closer to the default settings of Adam (0.9, 0.999). https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L1489
Have you experimented with the effect of different Adam betas settings on the final result, especially with large batch training (e.g. update_freq=16)?