URL

https://arxiv.org/abs/2202.10054
Affiliations
- Ananya Kumar, N/A
- Aditi Raghunathan, N/A
- Robbie Jones, N/A
- Tengyu Ma, N/A
- Percy Liang, N/A
  Abstract
- When transferring a pretrained model to a downstream task, two popularmethods are full fine-tuning (updating all the model parameters) and linearprobing (updating only the last linear layer -- the "head"). It is well knownthat fine-tuning leads to better accuracy in-distribution (ID). However, inthis paper, we find that fine-tuning can achieve worse accuracy than linearprobing out-of-distribution (OOD) when the pretrained features are good and thedistribution shift is large. On 10 distribution shift datasets(Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW,ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains onaverage 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. Weshow theoretically that this tradeoff between ID and OOD accuracy arises evenin a simple setting: fine-tuning overparameterized two-layer linear networks.We prove that the OOD error of fine-tuning is high when we initialize with afixed or random head -- this is because while fine-tuning learns the head, thelower layers of the neural network change simultaneously and distort thepretrained features. Our analysis suggests that the easy two-step strategy oflinear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuningheuristic, combines the benefits of both fine-tuning and linear probing.Empirically, LP-FT outperforms both fine-tuning and linear probing on the abovedatasets (1% better ID, 10% better OOD than full fine-tuning).
  Translation (by gpt-3.5-turbo)
事前学習済みモデルをダウンストリームタスクに転移する際、2つの一般的な方法は、全体のファインチューニング（すべてのモデルパラメータを更新する）と線形プロービング（最後の線形層、つまり「ヘッド」のみを更新する）です。ファインチューニングがID内でより高い精度をもたらすことはよく知られています。しかし、本論文では、事前学習済みの特徴が良好で、分布のシフトが大きい場合、ファインチューニングが線形プロービングよりもOOD（分布外）で精度が低くなることがあることを発見しました。10の分布シフトデータセット（Breeds-Living17、Breeds-Entity30、DomainNet、CIFAR→STL、CIFAR10.1、FMoW、ImageNetV2、ImageNet-R、ImageNet-A、ImageNet-Sketch）において、ファインチューニングは平均してIDで2％高い精度を得ますが、線形プロービングよりもOODで7％低い精度を得ます。我々は、ファインチューニングが過剰パラメータ化された2層の線形ネットワークに対して行われた場合でも、IDとOODの精度のトレードオフが生じることを理論的に示します。我々は、固定またはランダムなヘッドで初期化した場合、ファインチューニングのOODエラーが高くなることを証明します。これは、ファインチューニングがヘッドを学習する一方で、ニューラルネットワークの下層が同時に変化し、事前学習済みの特徴が歪むためです。我々の分析は、ファインチューニングと線形プロービングの両方の利点を組み合わせたファインチューニングヒューリスティックとして使用されることがある簡単な2段階戦略の線形プロービング後の全体のファインチューニング（LP-FT）が、両方のデータセットでファインチューニングと線形プロービングを上回ることを示唆しています（IDで1％、OODで10％の精度向上）。
Summary (by gpt-3.5-turbo)
事前学習済みモデルをダウンストリームタスクに転移する際、ファインチューニングと線形プロービングの2つの方法があるが、本研究では、分布のシフトが大きい場合、ファインチューニングが線形プロービングよりも分布外で精度が低くなることを発見した。LP-FTという2段階戦略の線形プロービング後の全体のファインチューニングが、両方のデータセットでファインチューニングと線形プロービングを上回ることを示唆している。

AkihikoWatanabe / paper_notes

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution, Ananya Kumar+, N/A, arXiv'22 #681

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)