Closed ygtxr1997 closed 2 years ago
If you can read the corresponding paper (
Hi ygtxr1997,
Thank you for your interest in our work and apologies for the late response.
In our paper, we study ViTs based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. We study if such flexibility (in attending image-wide context conditioned on a given patch) offers any advantages over convolutional neural networks designs and also if it can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations.
A brief summary of our observations is as follows:
ViT are robust to distribution shifts such as common corruptions when trained with heavy augmentations but this does not mean they care about the global Image context or Image structure.
They perform well on Shuffled Images indicating that there is a need to improve positional encoding.
ViT when trained on stylized ImageNet (without local texture) can reach to Human level performance on cue conflit experiment designed by Robert Geirhos.But interestingly become significantly vulnerable to common corruptions and adversarial attack.
ViT features are dynamic in nature, and we show one such example by introducing shape token for distilling shape information. ViT features can show higher shape or texture bias depending on the token type.
A single ViT can be viewed as an ensemble and its tokens are more generalizable (e.g., when using as off-the-shelf features) than similar Convolutional networks.
I hope this answers your question. Please feel free to ask any other questions.
:) Muzammal
Can you explain why the Transformers show such good occlusion-robustness compared CNNs?