Denoising Vision Transformers

[**Jiawei Yang**](https://jiawei-yang.github.io/)^1*† · [**Katie Z Luo**](https://www.cs.cornell.edu/~katieluo/)^2* · [**Jiefeng Li**](https://jeffli.site/)³ · [**Congyue Deng**](https://cs.stanford.edu/~congyue/)⁴
[**Leonidas Guibas**](https://geometry.stanford.edu/member/guibas/)⁴ · [**Dilip Krishnan**](https://dilipkay.wordpress.com/)⁵ · [**Kilian Q. Weinberger**](https://www.cs.cornell.edu/~kilian/)²
[**Yonglong Tian**](https://people.csail.mit.edu/yonglong/)⁵ · [**Yue Wang**](https://yuewang.xyz/)¹ ¹University of Southern California ²Cornell University
³Shanghai Jiaotong University ⁴Stanford University
⁵Google Research
†project lead *equal technical contribution contribution Accepted to ECCV 2024

TL;DR

This work presents Denoising Vision Transformers (DVT). It removes the visually annoying artifacts commonly seen in ViTs' feature maps and substaintially improves the downstream performance of dense recognition tasks.

teaser

Citation

@article{yang2024denoising,
  author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  title = {DVT: Denoising Vision Transformers},
  journal = {arXiv preprint arXiv:2401.02957},
  year = {2024},
}

This README file and codebase are legacy. We will update them soon.

Installation

Create a conda environment.

conda create -n dvt python=3.10 -y

Activate the environment.

conda activate dvt

Install dependencies from requirements.txt.

pip install -r requirements.txt

Install tiny-cuda-nn manually:

pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error nvcc fatal : Unsupported gpu architecture compute_89, try the following command:

TCNN_CUDA_ARCHITECTURES=86 pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error: parameter packs not expanded with ‘...’, Refer to this solution on GitHub.

Data preparation

PASCAL-VOC 2007 and 2012: Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder data, e.g.,

mkdir -p data
cd data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xf VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar

In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages and data/VOC2012/JPEGImages, excluding the validation images, and then randomly shuffling them.

ADE20K: [legacy, need to check] Please download the ADE20K dataset and put the data in data/ADEChallengeData2016.
NYU-D: Please download the NYU-depth dataset and put the data in data/nyu. Results are provided given the 2014 annotations following previous works.
ImageNet (Optional):
- Download the ImageNet dataset from http://www.image-net.org/
- Extract data following the instructions at here.
- Put the data in data/imagenet.

Run the code

See sample_scripts for examples of running the code.

We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.

Main Results and Checkpoints

VOC Evaluation Results

	mIoU	aAcc	mAcc	Logfile
MAE	50.24	88.02	63.15	log
MAE + DVT	50.53	88.06	63.29	log
DINO	63.00	91.38	76.35	log
DINO + DVT	66.22	92.41	78.14	log
Registers	83.64	96.31	90.67	log
Registers + DVT	84.50	96.56	91.45	log
DeiT3	70.62	92.69	81.23	log
DeiT3 + DVT	73.36	93.34	83.74	log
EVA	71.52	92.76	82.95	log
EVA + DVT	73.15	93.43	83.55	log
CLIP	77.78	94.74	86.57	log
CLIP + DVT	79.01	95.13	87.48	log
DINOv2	83.60	96.30	90.82	log
DINOv2 + DVT	84.84	96.67	91.70	log

ADE20K Evaluation Results

	mIoU	aAcc	mAcc	Logfile
MAE	23.60	68.54	31.49	log
MAE + DVT	23.62	68.58	31.25	log
DINO	31.03	73.56	40.33	log
DINO + DVT	32.40	74.53	42.01	log
Registers	48.22	81.11	60.52	log
Registers + DVT	49.34	81.94	61.70	log
DeiT3	32.73	72.61	42.81	log
DeiT3 + DVT	36.57	74.44	49.01	log
EVA	37.45	72.78	49.74	log
EVA + DVT	37.87	75.02	49.81	log
CLIP	40.51	76.44	52.47	log
CLIP + DVT	41.10	77.41	53.07	log
DINOv2	47.29	80.84	59.18	log
DINOv2 + DVT	48.66	81.89	60.24	log

NYU-D Evaluation Results

	RMSE	Rel	Logfile
MAE	0.6695	0.2334	log
MAE + DVT	0.7080	0.2560	log
DINO	0.5832	0.1701	log
DINO + DVT	0.5780	0.1731	log
Registers	0.3969	0.1190	log
Registers + DVT	0.3880	0.1157	log
DeiT3	0.588	0.1788	log
DeiT3 + DVT	0.5891	0.1802	log
EVA	0.6446	0.1989	log
EVA + DVT	0.6243	0.1964	log
CLIP	0.5598	0.1679	log
CLIP + DVT	0.5591	0.1667	log
DINOv2	0.4034	0.1238	log
DINOv2 + DVT	0.3943	0.1200	log

Denoiser Checkpoints

[ ] To be released.

Jiawei-Yang / Denoising-ViT

readme