deepseek-ai / DeepSeek-LLM

DeepSeek LLM: Let there be answers
https://chat.deepseek.com/
MIT License
1.33k stars 87 forks source link

Learning rate schedule seems very helpful. #1

Closed GanjinZero closed 7 months ago

GanjinZero commented 7 months ago

It appears a significant performance jump on point where lr decay.

zdaxie commented 7 months ago

We appreciate your interest in the learning rate scheduler choices for DeepSeek LLM.

The rationale behind our decision to use a multi-stage learning rate scheduler will be detailed in our forthcoming technical report. But I would like to share some preliminary insights based on our current understanding. I assume that the cosine learning rate scheduler may perform better in the initial stage, primarily due to its continually decreasing learning rate. Also, based on our observations so far, the overall difference between these two schedulers does not appear to be significantly large. Please stay tuned for our detailed report for more definitive conclusions.