URL

https://arxiv.org/abs/2407.12332
Authors
- Mohamad Amin Mohamadi
- Zhiyuan Li
- Lei Wu
- Danica J. Sutherland
  Abstract
- We present a theoretical explanation of the grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when thekernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $\ell{\infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell{\infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.
  Translation (by gpt-4o-mini)
私たちは、モデルが過学習の後に一般化する「grokking」現象の理論的説明を提示します。この現象は、元々研究されていたモジュラー加算の問題に関連しています。まず、勾配降下法の初期段階において、「カーネル領域」がほぼ成立しているとき、順列不変モデルがモジュラー加算において小さな母集団誤差を達成するためには、すべての可能なデータポイントの少なくとも一定の割合を観察する必要があることを示します。しかし、最終的にはモデルはカーネル領域を脱出します。私たちは、バウンドされた$\ell{\infty}$ノルムを持つ二層の二次ネットワークが、限られたトレーニングポイントでゼロのトレーニング損失を達成し、良好に一般化することを示します。さらに、そのようなネットワークが存在し、勾配降下法によって小さな$\ell{\infty}$正則化を用いて見つけることができることを示します。私たちはまた、これらのネットワークやシンプルなトランスフォーマーが、初期に過学習した後にのみカーネル領域を脱出するという実証的証拠を提供します。これらの結果を総合すると、私たちの研究は、カーネルのような挙動から深層ネットワークにおける勾配降下法の制限挙動への移行の結果としてのgrokkingのケースを強く支持しています。
Summary (by gpt-4o-mini)
モデルの「grokking」現象を理論的に説明し、モジュラー加算問題に関連付ける。勾配降下法の初期段階では、順列不変モデルが小さな母集団誤差を達成するために一定割合のデータポイントを観察する必要があるが、最終的にはカーネル領域を脱出する。二層の二次ネットワークが限られたトレーニングポイントでゼロのトレーニング損失を達成し、良好に一般化することを示し、実証的証拠も提供。これにより、grokkingは深層ネットワークにおける勾配降下法の制限挙動への移行の結果であることが支持される。

AkihikoWatanabe / paper_notes

Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition, Mohamad Amin Mohamadi+, arXiv'24 #1511

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)