Open gatheluck opened 3 years ago
ViTとCNNが獲得している特徴表現の違いと、その違いは何故生まれるのかを調査した研究。
A study investigating the differences in feature representations acquired by ViT and CNN, and where these differences come from.
とにかく実験量が多く、計算リソースとJFT-300Mを使って上から殴ってくるスタイルの研究。
The amount of experiments is huge, and they are using unlimited computational resources and JFT-300M to beat others.
複数の実験によって下記のような点を明らかにしたこと。 ViTが獲得している特徴表現は、CNNが獲得しているものと比較して下記の点が顕著に異なっていた。
Clarifing following things by multiple experiments. The feature representation acquired by ViT is significantly different from those acquired by CNN in the following aspects.
To analyze representation similarity, they used the centered kernel alignment (CKA). CKA takes representations (activation matrices) of two layers as input and its output is similarity of input representations. (Figure 1 is visualization result of CKA of ViT and ResNet.)
And they also computing the CKA between ViT and ResNet. The result shows that the lower half of ResNet layers are similar to around the lowest quarter of ViT layers.
Plotting attention head mean distances shows lower ViT layers attend both locally and globally when they are trained by Mega dataset.
However, if the number of dataset becomes small (like ImageNet), ViT does not learn to attend locally in earlier layers.
Lower layer representations of ResNet are most similar to representations corresponding to local attention heads of ViT.
論文リンク
公開日(yyyy/mm/dd)
2021/08/19 Google Research Brain Team
概要
TeX