Paper 第二节预训练 2.2 节：为什么对不同 size 的数据集都要训练至高达 150B tokens？ - Githubissues

deepseek-ai / DeepSeek-Math

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

MIT License

821 stars 51 forks source link

Paper 第二节预训练 2.2 节：为什么对不同 size 的数据集都要训练至高达 150B tokens？ #24

Open yucc-leon opened 5 months ago

yucc-leon commented 5 months ago

Math 模型使用的数据集大小为 120B tokens 所对比的数据集分别为

8.9B tokens
13.6B tokens
13.6B×4+10.3B×1+28.0B ×2 ≈120B tokens （如果以上数据有误请纠正我）意味着最小的数据集可能需要训练接近 20 个 epoch，较大概率出现 overfitting 从而导致性能下降。一般来说可能更公平的比较是否应该是选择一个更小的数值，例如最小数据集的大小或更小，超过阈值的降采样吗？

想请教下实验中这样的设定是基于什么考虑