Muzammal-Naseer / IPViT

Official repository for "Intriguing Properties of Vision Transformers" (NeurIPS 2021--Spotlight)
176 stars 19 forks source link

The exiciting work! #4

Closed ygtxr1997 closed 2 years ago

ygtxr1997 commented 3 years ago

Can you explain why the Transformers show such good occlusion-robustness compared CNNs?

muzairkhattak commented 2 years ago

If you can read the corresponding paper ( https://openreview.net/pdf?id=o2mbl-Hmfgd ), you will find out the explanation in detail. Mainly this is due to the dynamic receptive fields of vision transformers.

Muzammal-Naseer commented 2 years ago

Hi ygtxr1997,

Thank you for your interest in our work and apologies for the late response.

In our paper, we study ViTs based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. We study if such flexibility (in attending image-wide context conditioned on a given patch) offers any advantages over convolutional neural networks designs and also if it can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations.

A brief summary of our observations is as follows:

  1. ViT are robust to distribution shifts such as common corruptions when trained with heavy augmentations but this does not mean they care about the global Image context or Image structure.

  2. They perform well on Shuffled Images indicating that there is a need to improve positional encoding.

  3. ViT when trained on stylized ImageNet (without local texture) can reach to Human level performance on cue conflit experiment designed by Robert Geirhos.But interestingly become significantly vulnerable to common corruptions and adversarial attack.

  4. ViT features are dynamic in nature, and we show one such example by introducing shape token for distilling shape information. ViT features can show higher shape or texture bias depending on the token type.

  5. A single ViT can be viewed as an ensemble and its tokens are more generalizable (e.g., when using as off-the-shelf features) than similar Convolutional networks.

I hope this answers your question. Please feel free to ask any other questions.

:) Muzammal