URL

https://arxiv.org/abs/2310.02226
Affiliations
- Sachin Goyal, N/A
- Ziwei Ji, N/A
- Ankit Singh Rawat, N/A
- Aditya Krishna Menon, N/A
- Sanjiv Kumar, N/A
- Vaishnavh Nagarajan, N/A
  Abstract
- Language models generate responses by producing a series of tokens inimmediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$hidden vectors per layer, one vector per preceding token. What if instead wewere to let the model manipulate say, $K+10$ hidden vectors, before it outputsthe $(K+1)^{th}$ token? We operationalize this idea by performing training andinference on language models with a (learnable) $\textit{pause}$ token, asequence of which is appended to the input prefix. We then delay extracting themodel's outputs until the last pause token is seen, thereby allowing the modelto process extra computation before committing to an answer. We empiricallyevaluate $\textit{pause-training}$ on decoder-only models of 1B and 130Mparameters with causal pretraining on C4, and on downstream tasks coveringreasoning, question-answering, general understanding and fact recall. Our mainfinding is that inference-time delays show gains when the model is bothpre-trained and finetuned with delays. For the 1B model, we witness gains on 8of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task ofSQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task ofGSM8k. Our work raises a range of conceptual and practical future researchquestions on making delayed next-token prediction a widely applicable newparadigm.
  Translation (by gpt-3.5-turbo)
言語モデルは、$K$個の隠れベクトルを1つのトークンごとに操作することで、応答を生成します。つまり、$(K+1)^{th}$トークンは、1つの前のトークンごとに1つのベクトルを操作することで生成されます。では、代わりにモデルに$(K+10)$個の隠れベクトルを操作させてから$(K+1)^{th}$トークンを出力させることはどうでしょうか？このアイデアを具体化するために、入力の接頭辞に(学習可能な)$\textit{pause}$トークンのシーケンスを追加して、言語モデルでのトレーニングと推論を行います。そして、最後の$\textit{pause}$トークンが現れるまでモデルの出力を抽出するのを遅延させることで、モデルが回答を確定する前に追加の計算を行うことができます。我々は、C4での因果的な事前学習を行った1Bおよび130Mパラメータのデコーダーのみモデルに対して$\textit{pause-training}$を実証的に評価し、推論、質問応答、一般的な理解、事実の回想をカバーするダウンストリームタスクで評価しました。主な結果は、モデルが遅延を伴うように事前学習とファインチューニングを行うと、推論時の遅延が利益をもたらすことです。1Bモデルでは、9つのタスクのうち8つで利益が見られ、特にSQuADのQAタスクで$18\%$のEMスコアの利益、CommonSenseQAで$8\%$の利益、GSM8kの推論タスクで$1\%$の精度の利益が見られました。我々の研究は、遅延した次のトークンの予測を広く適用可能な新しいパラダイムとするための概念的および実践的な将来の研究課題を提起しています。
Summary (by gpt-3.5-turbo)
言語モデルのトレーニングと推論において、遅延を導入することでモデルの性能を向上させる手法を提案しました。具体的には、入力に特定のトークンを追加し、そのトークンが現れるまでモデルの出力を遅らせることで、追加の計算を行うことができます。実験結果では、この手法が推論タスクにおいて有益であり、特にQAタスクでの性能向上が見られました。今後は、この遅延予測の手法をさらに研究していく必要があります。

AkihikoWatanabe / paper_notes

Think before you speak: Training Language Models With Pause Tokens, Sachin Goyal+, N/A, ICLR'24 #1072

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)