Unsupervised multitask pre-training has been the critical method behind therecent success of language models (LMs). However, supervised multitask learningstill holds significant promise, as scaling it in the post-training stagetrends towards better generalization. In this paper, we explore supervisedmultitask pre-training by proposing Instruction Pre-Training, a framework thatscalably augments massive raw corpora with instruction-response pairs topre-train LMs. The instruction-response pairs are generated by an efficientinstruction synthesizer built on open-source models. In our experiments, wesynthesize 200M instruction-response pairs covering 40+ task categories toverify the effectiveness of Instruction Pre-Training. In pre-training fromscratch, Instruction Pre-Training not only consistently enhances pre-trainedbase models but also benefits more from further instruction tuning. Incontinual pre-training, Instruction Pre-Training enables Llama3-8B to becomparable to or even outperform Llama3-70B. Our model, code, and data areavailable at https://github.com/microsoft/LMOps.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)