gatheluck / PaperReading

Notes about papers (in Japanese)
0 stars 0 forks source link

[2021] Do Vision Transformers See Like Convolutional Neural Networks? #248

Open gatheluck opened 3 years ago

gatheluck commented 3 years ago

論文リンク

公開日(yyyy/mm/dd)

2021/08/19 Google Research Brain Team

概要

TeX

% yyyy/mm/dd
@article{
    raghu2021do,
    title={Do Vision Transformers See Like Convolutional Neural Networks?},
    author={Maithra Raghu and Thomas Unterthiner and Simon Kornblith and Chiyuan Zhang and Alexey Dosovitskiy},
    journal=arXiv # "2108.08810",
    year={2021}
}
gatheluck commented 3 years ago

1.どんなもの(what is it?)

ViTとCNNが獲得している特徴表現の違いと、その違いは何故生まれるのかを調査した研究。

A study investigating the differences in feature representations acquired by ViT and CNN, and where these differences come from.

gatheluck commented 3 years ago

2. 先行研究と比べてどこがすごい?(What is the advantage over previous studies?)

とにかく実験量が多く、計算リソースとJFT-300Mを使って上から殴ってくるスタイルの研究。

The amount of experiments is huge, and they are using unlimited computational resources and JFT-300M to beat others.

gatheluck commented 3 years ago

3. 技術や手法のキモはどこ?

複数の実験によって下記のような点を明らかにしたこと。 ViTが獲得している特徴表現は、CNNが獲得しているものと比較して下記の点が顕著に異なっていた。

Clarifing following things by multiple experiments. The feature representation acquired by ViT is significantly different from those acquired by CNN in the following aspects.

gatheluck commented 3 years ago

4.どうやって有効だと検証した?(How did you verify that it works?)

To analyze representation similarity, they used the centered kernel alignment (CKA). CKA takes representations (activation matrices) of two layers as input and its output is similarity of input representations. (Figure 1 is visualization result of CKA of ViT and ResNet.)

Screen Shot 2021-09-01 at 8 14 07

And they also computing the CKA between ViT and ResNet. The result shows that the lower half of ResNet layers are similar to around the lowest quarter of ViT layers.

cka-vit-cnn
gatheluck commented 3 years ago

Plotting attention head mean distances shows lower ViT layers attend both locally and globally when they are trained by Mega dataset.

Screen Shot 2021-09-01 at 10 20 16

However, if the number of dataset becomes small (like ImageNet), ViT does not learn to attend locally in earlier layers.

Screen Shot 2021-09-01 at 10 20 27
gatheluck commented 3 years ago

Lower layer representations of ResNet are most similar to representations corresponding to local attention heads of ViT.

Screen Shot 2021-09-01 at 10 53 30