AkihikoWatanabe commented 1 year ago

URL

https://arxiv.org/abs/2307.03381
Affiliations
- Nayoung Lee, N/A
- Kartik Sreenivasan, N/A
- Jason D. Lee, N/A
- Kangwook Lee, N/A
- Dimitris Papailiopoulos, N/A
  Abstract
- Large language models like GPT-4 exhibit emergent capabilities acrossgeneral-purpose tasks, such as basic arithmetic, when trained on extensive textdata, even though these tasks are not explicitly encoded by the unsupervised,next-token prediction objective. This study investigates how smalltransformers, trained from random initialization, can efficiently learnarithmetic operations such as addition, multiplication, and elementaryfunctions like square root, using the next-token prediction objective. We firstdemonstrate that conventional training data is not the most effective forarithmetic learning, and simple formatting changes can significantly improveaccuracy. This leads to sharp phase transitions as a function of training datascale, which, in some cases, can be explained through connections to low-rankmatrix completion. Building on prior work, we then train on chain-of-thoughtstyle data that includes intermediate step results. Even in the completeabsence of pretraining, this approach significantly and simultaneously improvesaccuracy, sample complexity, and convergence speed. We also study the interplaybetween arithmetic and text data during training and examine the effects offew-shot prompting, pretraining, and model scale. Additionally, we discusslength generalization challenges. Our work highlights the importance ofhigh-quality, instructive data that considers the particular characteristics ofthe next-word prediction objective for rapidly eliciting arithmeticcapabilities.
  Translation (by gpt-3.5-turbo)
GPT-4のような大規模言語モデルは、広範な一般的なタスク（たとえば基本的な算術）において、教師なしの次のトークン予測目的に明示的にエンコードされていないにもかかわらず、広範なテキストデータで訓練されることで新たな能力を発揮することが示されています。本研究では、ランダムな初期化から訓練された小規模トランスフォーマーが、次のトークン予測目的を使用して加算、乗算などの算術演算や平方根などの基本的な関数を効率的に学習できるかを調査しています。まず、従来の訓練データが算術学習に最も効果的ではないことを示し、単純なフォーマットの変更が精度を大幅に改善することを示します。これにより、訓練データのスケールに関する鋭い相転移が生じ、いくつかの場合には低ランク行列補完との関連性を通じて説明できます。前の研究に基づいて、中間ステップの結果を含むchain-of-thoughtスタイルのデータで訓練を行います。事前訓練の完全な欠如でも、このアプローチは精度、サンプルの複雑さ、収束速度を同時に大幅に改善します。また、訓練中の算術とテキストデータの相互作用、フューショットプロンプティング、事前訓練、モデルのスケールの影響についても研究します。さらに、長さの一般化の課題についても議論します。私たちの研究は、高品質で指導的なデータが、次の単語予測目的の特定の特性を考慮した算術能力の迅速な引き出しにおいて重要であることを強調しています。
Summary (by gpt-3.5-turbo)
本研究では、GPT-4のような大規模言語モデルが、教師なしのトークン予測目的に明示的にエンコードされていないにもかかわらず、算術演算や基本的な関数を効率的に学習できることを示しています。訓練データのフォーマットの変更やchain-of-thoughtスタイルのデータの使用により、精度や収束速度が改善されます。また、訓練中の算術とテキストデータの相互作用やモデルのスケールの影響も研究されています。この研究は、高品質な指導的なデータが算術能力の引き出しにおいて重要であることを強調しています。

AkihikoWatanabe commented 1 year ago

小規模なtransformerに算術演算を学習させ、どのような学習データが効果的か調査。CoTスタイルの詳細なスクラッチパッドを学習データにすることで、plainなもの等と比較して、予測性能や収束速度などが劇的に改善した

AkihikoWatanabe commented 1 year ago

結局next token predictionで学習させているみたいだけど、本当にそれで算術演算をモデルが理解しているのだろうか?という疑問がいつもある

AkihikoWatanabe / paper_notes

Teaching Arithmetic to Small Transformers, Nayoung Lee+, N/A, arXiv'23 #797

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)