facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.43k stars 6.4k forks source link

Difference of Adam betas setting with tensor2tensor #343

Closed xiaoda99 closed 5 years ago

xiaoda99 commented 5 years ago

Hi guys,

In en-de and en-fr transformer-big examples, adam_betas is set to (0.9, 0.98), which is the setting used in the Vaswani et al. 2017 paper. However, in the latest tensor2tensor repo, Adam betas is set to (0.9, 0.997), which is also closer to the default settings of Adam (0.9, 0.999). https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py#L1489

Have you experimented with the effect of different Adam betas settings on the final result, especially with large batch training (e.g. update_freq=16)?

myleott commented 5 years ago

The original paper used (0.9, 0.98): https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

We never tuned this hparam, so am not sure how much of a difference it makes compared to the pytorch default of 0.998.