Large language models like GPT-4 exhibit emergent capabilities acrossgeneral-purpose tasks, such as basic arithmetic, when trained on extensive textdata, even though these tasks are not explicitly encoded by the unsupervised,next-token prediction objective. This study investigates how smalltransformers, trained from random initialization, can efficiently learnarithmetic operations such as addition, multiplication, and elementaryfunctions like square root, using the next-token prediction objective. We firstdemonstrate that conventional training data is not the most effective forarithmetic learning, and simple formatting changes can significantly improveaccuracy. This leads to sharp phase transitions as a function of training datascale, which, in some cases, can be explained through connections to low-rankmatrix completion. Building on prior work, we then train on chain-of-thoughtstyle data that includes intermediate step results. Even in the completeabsence of pretraining, this approach significantly and simultaneously improvesaccuracy, sample complexity, and convergence speed. We also study the interplaybetween arithmetic and text data during training and examine the effects offew-shot prompting, pretraining, and model scale. Additionally, we discusslength generalization challenges. Our work highlights the importance ofhigh-quality, instructive data that considers the particular characteristics ofthe next-word prediction objective for rapidly eliciting arithmeticcapabilities.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)