Language models generate responses by producing a series of tokens inimmediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$hidden vectors per layer, one vector per preceding token. What if instead wewere to let the model manipulate say, $K+10$ hidden vectors, before it outputsthe $(K+1)^{th}$ token? We operationalize this idea by performing training andinference on language models with a (learnable) $\textit{pause}$ token, asequence of which is appended to the input prefix. We then delay extracting themodel's outputs until the last pause token is seen, thereby allowing the modelto process extra computation before committing to an answer. We empiricallyevaluate $\textit{pause-training}$ on decoder-only models of 1B and 130Mparameters with causal pretraining on C4, and on downstream tasks coveringreasoning, question-answering, general understanding and fact recall. Our mainfinding is that inference-time delays show gains when the model is bothpre-trained and finetuned with delays. For the 1B model, we witness gains on 8of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task ofSQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task ofGSM8k. Our work raises a range of conceptual and practical future researchquestions on making delayed next-token prediction a widely applicable newparadigm.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)