Closed futuristx closed 7 months ago
Thank you for your query regarding the learning rate used in DeepSeek LLM!
We opted for a larger learning rate compared to LLaMA2, partially due to our use of an increased batch size — scaling from 4M (1024*4096) to 18.8M (4608*4096). This decision was also influenced by the results obtained from our smaller-scale experiments. We're looking forward to sharing more comprehensive details in our upcoming technical report, which will be released soon.
Stay tuned for more updates!
why init lr can be so much higher than llama2-70b? And, would such a lr decay schedule be remarkable better than a routine cosine decay lr schedule?