why you choose 7e-4 instead of 512 ** -0.5 as the learning rate

THUNLP-MT / THUMT

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group

BSD 3-Clause "New" or "Revised" License

701 stars 197 forks source link

why you choose 7e-4 instead of 512 ** -0.5 as the learning rate #80

Closed shawnkx closed 4 years ago

shawnkx commented 4 years ago

I found in the pytorch version transformer, you use the 7e-4 as the lr. Could you tell me why do you choose this intead of 512 ** -0.5 as the learning rate. Thanks!

Playinf commented 4 years ago

In original implementation, the maximum learning rate for Transformer Base is hidden_size ** -0.5 * (warmup_steps ** -0.5). Given hidden_size = 512 and warmup_steps = 4000, the learning rate is 0.000698, which nears 7e-4.