Transfer learning has become a cornerstone of computer vision with the advent
of ImageNet features, yet little work has been done to evaluate the performance
of ImageNet architectures across different datasets. An implicit hypothesis in
modern computer vision research is that models that perform better on ImageNet
necessarily perform better on other vision tasks. However, this hypothesis has
never been systematically tested. Here, we compare the performance of 13
classification models on 12 image classification tasks in three settings: as
fixed feature extractors, fine-tuned, and trained from random initialization.
We find that, when networks are used as fixed feature extractors, ImageNet
accuracy is only weakly predictive of accuracy on other tasks ($r^2=0.24$). In
this setting, ResNets consistently outperform networks that achieve higher
accuracy on ImageNet. When networks are fine-tuned, we observe a substantially
stronger correlation ($r^2 = 0.86$). We achieve state-of-the-art performance on
eight image classification tasks simply by fine-tuning state-of-the-art
ImageNet architectures, outperforming previous results based on specialized
methods for transfer learning. Finally, we observe that, on three small
fine-grained image classification datasets, networks trained from random
initialization perform similarly to ImageNet-pretrained networks. Together, our
results show that ImageNet architectures generalize well across datasets, with
small improvements in ImageNet accuracy producing improvements across other
tasks, but ImageNet features are less general than previously suggested.
Kornblith, Simon, Shlens, Jonathon, Le, Quoc V.
Transfer learning has become a cornerstone of computer vision with the advent of ImageNet features, yet little work has been done to evaluate the performance of ImageNet architectures across different datasets. An implicit hypothesis in modern computer vision research is that models that perform better on ImageNet necessarily perform better on other vision tasks. However, this hypothesis has never been systematically tested. Here, we compare the performance of 13 classification models on 12 image classification tasks in three settings: as fixed feature extractors, fine-tuned, and trained from random initialization. We find that, when networks are used as fixed feature extractors, ImageNet accuracy is only weakly predictive of accuracy on other tasks ($r^2=0.24$). In this setting, ResNets consistently outperform networks that achieve higher accuracy on ImageNet. When networks are fine-tuned, we observe a substantially stronger correlation ($r^2 = 0.86$). We achieve state-of-the-art performance on eight image classification tasks simply by fine-tuning state-of-the-art ImageNet architectures, outperforming previous results based on specialized methods for transfer learning. Finally, we observe that, on three small fine-grained image classification datasets, networks trained from random initialization perform similarly to ImageNet-pretrained networks. Together, our results show that ImageNet architectures generalize well across datasets, with small improvements in ImageNet accuracy producing improvements across other tasks, but ImageNet features are less general than previously suggested.
https://arxiv.org/abs/1805.08974