I really appreciate the impressive code and paper. In your paper, I noticed that the AdamW learning rate is set to lr=0.000025. As this is my first time working with this optimizer, I would like to ask how to set the learning rate for a model with a large number of parameters. How did you determine the lr=0.000025 value in your experiments? Could you provide some tuning advice?
Dear Author,
I really appreciate the impressive code and paper. In your paper, I noticed that the AdamW learning rate is set to lr=0.000025. As this is my first time working with this optimizer, I would like to ask how to set the learning rate for a model with a large number of parameters. How did you determine the lr=0.000025 value in your experiments? Could you provide some tuning advice?
Thank you!