关于sft阶段中数据拼接的问题

deepseek-ai / DeepSeek-Math

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

MIT License

769 stars 45 forks source link

Open SymbolZH opened 3 months ago

SymbolZH commented 3 months ago

您好，论文3.2中有提到将训练数据随机拼接到4k token的长度，请问是指将sft数据拼接成（q0,a0,q1,a1,...）的形式后只计算answer部分的loss吗？非常感谢大佬们的工作~

qianxianyang commented 2 months ago

这里应该类似于预训练时，将batch里每个样本都拼接到上下文长度，从而提升训练的效率。