Closed GanjinZero closed 7 months ago
We appreciate your interest in the learning rate scheduler choices for DeepSeek LLM.
The rationale behind our decision to use a multi-stage learning rate scheduler will be detailed in our forthcoming technical report. But I would like to share some preliminary insights based on our current understanding. I assume that the cosine learning rate scheduler may perform better in the initial stage, primarily due to its continually decreasing learning rate. Also, based on our observations so far, the overall difference between these two schedulers does not appear to be significantly large. Please stay tuned for our detailed report for more definitive conclusions.
It appears a significant performance jump on point where lr decay.