When transferring a pretrained model to a downstream task, two popularmethods are full fine-tuning (updating all the model parameters) and linearprobing (updating only the last linear layer -- the "head"). It is well knownthat fine-tuning leads to better accuracy in-distribution (ID). However, inthis paper, we find that fine-tuning can achieve worse accuracy than linearprobing out-of-distribution (OOD) when the pretrained features are good and thedistribution shift is large. On 10 distribution shift datasets(Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW,ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains onaverage 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. Weshow theoretically that this tradeoff between ID and OOD accuracy arises evenin a simple setting: fine-tuning overparameterized two-layer linear networks.We prove that the OOD error of fine-tuning is high when we initialize with afixed or random head -- this is because while fine-tuning learns the head, thelower layers of the neural network change simultaneously and distort thepretrained features. Our analysis suggests that the easy two-step strategy oflinear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuningheuristic, combines the benefits of both fine-tuning and linear probing.Empirically, LP-FT outperforms both fine-tuning and linear probing on the abovedatasets (1% better ID, 10% better OOD than full fine-tuning).
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)