Intriguing Properties of Vision Transformers (NeurIPS'21--Spotlight)

Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Khan, & Ming-Hsuan Yang

Paper (arxiv), Reviews & Response, Video Presentation, Poster

Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Our code will be publicly released.

demo

1) Shape Biased Models 2) Off the Shelf Features 3) Global Image Context 4) Varying Patch Sizes and Positional Encoding 5) References 6) Citation

Requirements

The code requires Python 3.8 because PyTorch 1.7 doesn't support Python versions above 3.8.

After verifying that you are using Python 3.8, install the dependencies with:

pip install -r requirements.txt

Shape Biased Models

^(top) Our shape biased pretrained models can be downloaded from here. We summarise the performance of each model below.

Model	Jaccard Index	Pretrained
DeiT-T	32.2	baseline
DeiT-T-SIN	29.4	Link
DeiT-T-SIN (distilled)	40.0	Link
DeiT-S	29.2	baseline
DeiT-S-SIN	37.5	Link
DeiT-S-SIN (distilled)	42.0	Link

Code for evaluating their shape bias using auto segmentation on the PASCAL VOC dataset can be found under scripts. Please fix any paths as necessary. You may place the VOC devkit folder under data/voc or fix the paths appropriately.

Go to the PASCAL VOC site.
Search for "development kit" (this is the current link, but it may change without notice)
Download "training/validation data" (a large file)
Expand the tar file and copy its entire structure under the data directory

It should look like this:

data
├── voc
│   ├── VOCdevkit
│       ├── VOC2012
│           ├── ... the VOC directories

Direct Implementation

Running segmentation evaluation on models (pre-trained models will be auto downloaded):

./scripts/eval_segmentation.sh

Visualizing segmentation for images in a given folder (pre-trained models will be auto downloaded):

./scripts/visualize_segmentation.sh

Additional Details

We compare the different shape bias capacities of the class token and the shape (distillation) token for shape distilled models. As explained in scripts/eval_segmentation.sh, these evaluations can be carried out as follows.

Evaluate pre-trained model using the class token (default mode):

python evaluate_segmentation.py \
  --model_name "dino_small_dist" \
  --pretrained_weights "https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers/releases/download/v0/deit_s_sin_dist.pth" \
  --batch_size 256 \
  --patch_size 16 \
  --threshold 0.9

Evaluate pre-trained model using the shape (distillation) token:

python evaluate_segmentation.py \
  --model_name "dino_small_dist" \
  --pretrained_weights "https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers/releases/download/v0/deit_s_sin_dist.pth" \
  --batch_size 256 \
  --patch_size 16 \
  --threshold 0.9 \
  --use_shape

Note that the threshold parameter (controlling the quantity of attention flow similar to DINO) can be varied to obtain different segmentations. We have selected 0.9 for optimal segmentation performance on PASCAL VOC '12. Results for different threshold settings can be visualized using the following code (adjust threshold parameter appropriately). Add --use_shape argument to use the shape token for generating visualizations.

python evaluate_segmentation.py \
  --model_name "dino_small_dist" \
  --pretrained_weights "https://github.com/Muzammal-Naseer/Intriguing-Properties-of-Vision-Transformers/releases/download/v0/deit_s_sin_dist.pth" \
  --threshold 0.9 \
  --patch_size 16 \
  --generate_images \
  --test_dir "data/sample_images" \
  --save_path "data/sample_segmentations"

Off the Shelf Features

^(top) Training code for off-the-shelf experiment in classify_metadataset.py. Seven datasets (aircraft, CUB, DTD, fungi, GTSRB, Places365, and INAT) available by default. Set the appropriate dir path in classify_md.sh by fixing DATA_PATH. Note that for the ResNet baselines, we adopt the PyTorch official models. All training on transfer dataset is limited to updating a final linear layer using a standard training schedule.

[comment]: < off_shelf >

Direct Implementation

Run training and evaluation for a selected dataset (aircraft by default) using selected model (DeiT-T by default):

./scripts/classify_md.sh

Additional Details

Set the DATASET variable to one of aircraft, CUB, DTD, fungi, GTSRB, Places365, or INAT and model to one of resnet50, deit-tiny, or deit-small. Variable EXP_NAME can be set to any name (will be used for logging). Environment variable DATA_PATH should direct to the relevant dataset root directory. Note that all dataset classes are simple modifications of the standard torchvision ImageFolder class.

python classify_metadataset.py \
  --datasets "$DATASET" \
  --data-path "$DATA_PATH" \
  --model "$MODEL" \
  --batch_size 256 \
  --epochs 10 \
  --project "off_the_shelf" \
  --exp "$EXP_NAME"

Global Image Context (Occlusion & Shuffle)

^(top) We apply various occlusions and shuffle operations on images to explore the robust properties of ViT models. All evaluation is carried out on the ImageNet 2012 validation set.

Direct Implementation

For direct evaluation on ImageNet val set (change path in script) using our proposed occlusion techniques and shuffle operation run:

./scripts/evaluate_occlusion.sh
./scripts/evaluate_shuffle.sh
./scripts/evaluate_occlusion_supp.sh

Additional Occlusion Details

We present three patch based occlusion methods Random, Salient, and Non-Salient PatchDrop. For all scripts, the environment variable DATA_PATH should direct to the ImageNet dataset validation directory. Evaluation of a pretrained model under Random PatchDrop technique can be done as below:

python evaluate.py \
  --model_name deit_small_patch16_224 \
  --test_dir "$DATA_PATH" \
  --pretrained "https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth" \
  --random_drop

Evaluation of a pretrained model under Salient & Non-Salient PatchDrop technique can be done as below. This method borrows from DINO to select foreground and background pixels.

python evaluate.py \
  --model_name deit_small_patch16_224 \
  --test_dir "$DATA_PATH" \
  --pretrained "https://dl.fbaipublicfiles.com/deit/deit_small_patch16_224-cd65a155.pth" \
  --dino

We also experiment with Random PatchDrop under varied settings involving different grids sizes, at pixel level, with grid offsets, and also applying on intermediate feature maps. The defaults settings evaluates on a 14x14 grid. To evaluate on different grid sizes, run following (8x8 default - replace with desired):

python evaluate.py \
  --model_name deit_tiny_patch16_224 \
  --test_dir "$DATA_PATH" \
  --random_drop \
  --shuffle_size 8 8

Evaluate at pixel level as follows (when grid size is equivalent to image dimensions):

python evaluate.py \
  --model_name deit_tiny_patch16_224 \
  --test_dir "$DATA_PATH" \
  --random_drop \
  --shuffle_size 224 224

Using a grid with an offset:

python evaluate.py \
  --model_name deit_tiny_patch16_224 \
  --test_dir "$DATA_PATH" \
  --random_drop \
  --random_offset_drop

Intermediate feature drop on DeiT models:

python evaluate.py \
  --model_name deit_tiny_patch16_224 \
  --test_dir "$DATA_PATH" \
  --lesion \
  --block_index 0 2 4 8 10

Intermediate feature drop on ResNet50:

python evaluate.py \
  --model_name resnet_drop \
  --test_dir "$DATA_PATH" \
  --lesion \
  --block_index 1 2 3 4 5

Note that intermediate feature drop is dependent on network architecture. Currently, only three DeiT variants (tiny, small, base) and ResNet50 are supported.

Additional Shuffle Details

Evaluate under the shuffle operation setting using a range of grid sizes to observe robustness to permutation invariance as follows:

python evaluate.py \
  --model_name deit_tiny_patch16_224 \
  --test_dir "$DATA_PATH" \
  --shuffle \
  --shuffle_h 2 2 4 4 8 14 16 \
  --shuffle_w 2 4 4 8 8 14 16 \

Varying Patch Sizes and Positional Encoding

^(top) We present pretrained models to study the effect of varying patch sizes and positional encoding:	DeiT-T Model	Top-1	Top-5
No Pos. Enc.	68.3	89.0	Link
Patch 22	68.7	89.0	Link
Patch 28	65.2	86.7	Link
Patch 32	63.1	85.3	Link
Patch 38	55.2	78.8	Link

Adversarial Patches

TBA

Stylized ImageNet Training

We generate Stylized ImageNet (SIN) following the SIN repository. We use the DeiT codebase to perform standard training of DeiT models using the SIN dataset instead of natural ImageNet to obtain SIN only models. The shape distilled models are obtained by using the DeiT distillation mechanism. Training is performed using natural ImageNet. Instead of a CNN trained on natural ImageNet, a CNN (or ViT) trained on SIN is used as the teacher model.

References

^(top) Code is based on DeiT repository and TIMM library. We thank the authors for releasing their codes.

Citation

If you find our work, this repository and pretrained Transformers useful. Please consider giving a star :star: and citation.

@inproceedings{
naseer2021intriguing,
title={Intriguing Properties of Vision Transformers},
author={Muzammal Naseer and Kanchana Ranasinghe and Salman Khan and Munawar Hayat and Fahad Khan and Ming-Hsuan Yang},
booktitle={Advances in Neural Information Processing Systems},
editor={A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan},
year={2021},
url={https://openreview.net/forum?id=o2mbl-Hmfgd}
}

Muzammal-Naseer / IPViT

readme

Intriguing Properties of Vision Transformers (NeurIPS'21--Spotlight)

Contents

Requirements

Shape Biased Models

Direct Implementation

Additional Details

Off the Shelf Features

Direct Implementation

Additional Details

Global Image Context (Occlusion & Shuffle)

Direct Implementation

Additional Occlusion Details

Additional Shuffle Details

Varying Patch Sizes and Positional Encoding

Adversarial Patches

Stylized ImageNet Training

References

Citation