AkihikoWatanabe commented 4 weeks ago

URL

https://arxiv.org/abs/2012.13255
Affiliations
- Armen Aghajanyan, N/A
- Luke Zettlemoyer, N/A
- Sonal Gupta, N/A
  Abstract
- Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90\% of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.
  Translation (by gpt-4o-mini)
事前学習された言語モデルは、非常に幅広い言語理解タスクに対して最先端の結果を出すためにファインチューニングすることができるが、このプロセスのダイナミクスは特にデータが少ない状況では十分に理解されていない。なぜ、数百万のパラメータを持つモデルを、数百または数千のラベル付き例しかないデータセットで調整するために、比較的単純な勾配降下アルゴリズム（例えば、強い正則化なしで）を使用できるのか？本論文では、内因次元の観点からファインチューニングを分析することで、この驚くべき現象を説明するための経験的および理論的な直感を提供できると主張する。一般的な事前学習モデルは非常に低い内因次元を持つことを経験的に示す。言い換えれば、ファインチューニングにおいてフルパラメータ空間と同じくらい効果的な低次元の再パラメータ化が存在する。例えば、フル空間にランダムに投影された200の学習可能なパラメータのみを最適化することで、RoBERTaモデルを調整し、MRPCにおいてフルパラメータのパフォーマンスレベルの90％を達成することができる。さらに、事前学習が内因次元を暗黙的に最小化することを経験的に示し、驚くべきことに、より大きなモデルは固定された数の事前学習更新後に低い内因次元を持つ傾向があり、これが彼らの極端な効果を部分的に説明している。最後に、内因次元を低次元のタスク表現や圧縮に基づく一般化境界と関連付け、フルパラメータ数に依存しない内因次元に基づく一般化境界を提供する。
Summary (by gpt-4o-mini)
事前学習された言語モデルのファインチューニングのダイナミクスを内因次元の観点から分析し、少ないデータでも効果的に調整できる理由を説明。一般的なモデルは低い内因次元を持ち、フルパラメータ空間と同等の効果を持つ低次元の再パラメータ化が可能であることを示す。特に、RoBERTaモデルを用いて、少数のパラメータの最適化で高いパフォーマンスを達成できることを実証。また、事前学習が内因次元を最小化し、大きなモデルが低い内因次元を持つ傾向があることを示し、内因次元に基づく一般化境界を提案。

AkihikoWatanabe commented 4 weeks ago

ACL ver:https://aclanthology.org/2021.acl-long.568.pdf

AkihikoWatanabe commented 4 weeks ago

下記の元ポストを拝読の上論文を斜め読み。モデルサイズが大きいほど、特定の性能（論文中では2種類のデータセットでの90%のsentence prediction性能）をfinetuningで達成するために必要なパラメータ数は、モデルサイズが大きくなればなるほど小さくなっている。

LoRAとの関係性についても元ポスト中で言及されており、論文の中身も見て後で確認する。おそらく、LLMはBERTなどと比較して遥かにパラメータ数が大きいため、finetuningに要するパラメータ数はさらに小さくなっていることが想像され、LoRAのような少量のパラメータをconcatするだけでうまくいく、というような話だと思われる。興味深い。

元ポスト:https://x.com/bilzrd/status/1840445027438456838?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe / paper_notes

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning, Armen Aghajanyan+, N/A, ACL'21 #1439

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)