deepseek-ai / DeepSeek-LLM

DeepSeek LLM: Let there be answers
https://chat.deepseek.com/
MIT License
1.33k stars 87 forks source link

About LR schedule #3

Closed futuristx closed 7 months ago

futuristx commented 7 months ago

why init lr can be so much higher than llama2-70b? And, would such a lr decay schedule be remarkable better than a routine cosine decay lr schedule?

zdaxie commented 7 months ago

Thank you for your query regarding the learning rate used in DeepSeek LLM!

We opted for a larger learning rate compared to LLaMA2, partially due to our use of an increased batch size — scaling from 4M (1024*4096) to 18.8M (4608*4096). This decision was also influenced by the results obtained from our smaller-scale experiments. We're looking forward to sharing more comprehensive details in our upcoming technical report, which will be released soon.

Stay tuned for more updates!