In your paper, you use the learning rate above where β = 5000, which meaning the learning rate is less than 10−8. However, in your code, there is
Thus, the learning rate in your code is about 2e-5 with the parameter "scale_lr". Is there anything wrong?
In your paper, you use the learning rate above where β = 5000, which meaning the learning rate is less than 10−8. However, in your code, there is Thus, the learning rate in your code is about 2e-5 with the parameter "scale_lr". Is there anything wrong?