Closed shawnkx closed 4 years ago
In original implementation, the maximum learning rate for Transformer Base is
hidden_size ** -0.5 * (warmup_steps ** -0.5)
.
Given hidden_size = 512
and warmup_steps = 4000
, the learning rate is 0.000698, which nears 7e-4.
I found in the pytorch version transformer, you use the 7e-4 as the lr. Could you tell me why do you choose this intead of 512 ** -0.5 as the learning rate. Thanks!