Weight decay is a broadly used technique for training state-of-the-art deepnetworks, including large language models. Despite its widespread usage, itsrole remains poorly understood. In this work, we highlight that the role ofweight decay in modern deep learning is different from its regularizationeffect studied in classical learning theory. For overparameterized deepnetworks, we show how weight decay modifies the optimization dynamics enhancingthe ever-present implicit regularization of SGD via the loss stabilizationmechanism. In contrast, for underparameterized large language models trainedwith nearly online SGD, we describe how weight decay balances the bias-variancetradeoff in stochastic optimization leading to lower training loss. Moreover,we show that weight decay also prevents sudden loss divergences for bfloat16mixed-precision training which is a crucial tool for LLM training. Overall, wepresent a unifying perspective from ResNets on vision tasks to LLMs: weightdecay is never useful as an explicit regularizer but instead changes thetraining dynamics in a desirable way. Our code is available athttps://github.com/tml-epfl/why-weight-decay.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)