AkihikoWatanabe commented 2 weeks ago

URL

https://arxiv.org/abs/2410.21228v1
Authors
- Reece Shuttleworth
- Jacob Andreas
- Antonio Torralba
- Pratyusha Sharma
  Abstract
- Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to match the performance of fully fine-tuned models on various tasks with an extreme reduction in the number of trainable parameters. Even in settings where both methods learn similarly accurate models, \emph{are their learned solutions really equivalent?} We study how different fine-tuning methods change pre-trained models by analyzing the model's weight matrices through the lens of their spectral properties. We find that full fine-tuning and LoRA yield weight matrices whose singular value decompositions exhibit very different structure; moreover, the fine-tuned models themselves show distinct generalization behaviors when tested outside the adaptation task's distribution. More specifically, we first show that the weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}. Intruder dimensions do not appear during full fine-tuning. Second, we show that LoRA models with intruder dimensions, despite achieving similar performance to full fine-tuning on the target task, become worse models of the pre-training distribution and adapt less robustly to multiple tasks sequentially. Higher-rank, rank-stabilized LoRA models closely mirror full fine-tuning, even when performing on par with lower-rank LoRA models on the same tasks. These results suggest that models updated with LoRA and full fine-tuning access different parts of parameter space, even when they perform equally on the fine-tuned distribution. We conclude by examining why intruder dimensions appear in LoRA fine-tuned models, why they are undesirable, and how their effects can be minimized.
  Translation (by gpt-4o-mini)
ファインチューニングは、事前学習済みの大規模言語モデルを下流タスクに適応させるための重要なパラダイムです。最近、Low-Rank Adaptation（LoRA）などの手法が、トレーニング可能なパラメータの数を極端に削減しながら、さまざまなタスクにおいて完全にファインチューニングされたモデルと同等の性能を示すことが明らかになりました。しかし、両方の手法が同様に正確なモデルを学習する場合でも、\emph{彼らの学習した解は本当に同等なのでしょうか？} 本研究では、異なるファインチューニング手法が事前学習済みモデルをどのように変化させるかを、モデルの重み行列のスペクトル特性を通じて分析することによって調査します。私たちは、完全なファインチューニングとLoRAが、特異値分解が非常に異なる構造を示す重み行列を生成することを発見しました。さらに、ファインチューニングされたモデル自体は、適応タスクの分布の外でテストされたときに異なる一般化挙動を示します。具体的には、まずLoRAでトレーニングされた重み行列が新しい高ランクの特異ベクトルを持つことを示します。これを\emph{侵入次元}と呼びます。侵入次元は、完全なファインチューニング中には現れません。次に、侵入次元を持つLoRAモデルは、ターゲットタスクにおいて完全なファインチューニングと同様の性能を達成しているにもかかわらず、事前学習分布のモデルとしては劣化し、複数のタスクに対して順次適応する能力が低下することを示します。高ランクでランクが安定したLoRAモデルは、同じタスクで低ランクのLoRAモデルと同等の性能を発揮しても、完全なファインチューニングを密接に反映します。これらの結果は、LoRAと完全なファインチューニングで更新されたモデルが、ファインチューニングされた分布で同等の性能を発揮しても、パラメータ空間の異なる部分にアクセスしていることを示唆しています。最後に、LoRAファインチューニングモデルに侵入次元が現れる理由、なぜそれが望ましくないのか、そしてその影響を最小限に抑える方法を検討します。
Summary (by gpt-4o-mini)
ファインチューニング手法の違いが事前学習済みモデルに与える影響を、重み行列のスペクトル特性を通じて分析。LoRAと完全なファインチューニングは異なる構造の重み行列を生成し、LoRAモデルは新たな高ランクの特異ベクトル（侵入次元）を持つことが判明。侵入次元は一般化能力を低下させるが、同等の性能を達成することがある。これにより、異なるファインチューニング手法がパラメータ空間の異なる部分にアクセスしていることが示唆される。

AkihikoWatanabe commented 2 weeks ago

元ポスト: https://x.com/aratako_lm/status/1854838012909166973?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe commented 2 weeks ago

1423 や #1475 、双方の知見も交えて、LoRAの挙動を考察する必要がある気がする。それぞれ異なるデータセットやモデルで、LoRAとFFTを比較している。時間がないが後でやりたい。

あと、昨今はそもそも実験設定における変数が多すぎて、とりうる実験設定が多すぎるため、個々の論文の知見を鵜呑みにして一般化するのはやめた方が良い気がしている。

AkihikoWatanabe / paper_notes

LoRA vs Full Fine-tuning: An Illusion of Equivalence, Reece Shuttleworth+, arXiv'24 #1492

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)

1423 や #1475 、双方の知見も交えて、LoRAの挙動を考察する必要がある気がする。それぞれ異なるデータセットやモデルで、LoRAとFFTを比較している。時間がないが後でやりたい。

実験設定の違い

モデルのアーキテクチャ

1423: transformer-decoder

1475: transformer-decoder（LLaMA）

パラメータサイズ

1423: 1B, 2B, 4B, 8B, 16B

1475: 7B

Finetuningデータセットのタスク数

1タスクあたりのデータ量

trainableなパラメータ数