Jiawei-Yang / Denoising-ViT

This is the official code release for our work, Denoising Vision Transformers.
MIT License
302 stars 8 forks source link

Denoising Vision Transformers

[**Jiawei Yang**](https://jiawei-yang.github.io/)1*† · [**Katie Z Luo**](https://www.cs.cornell.edu/~katieluo/)2* · [**Jiefeng Li**](https://jeffli.site/)3 · [**Congyue Deng**](https://cs.stanford.edu/~congyue/)4
[**Leonidas Guibas**](https://geometry.stanford.edu/member/guibas/)4 · [**Dilip Krishnan**](https://dilipkay.wordpress.com/)5 · [**Kilian Q. Weinberger**](https://www.cs.cornell.edu/~kilian/)2
[**Yonglong Tian**](https://people.csail.mit.edu/yonglong/)5 · [**Yue Wang**](https://yuewang.xyz/)1 1University of Southern California   2Cornell University
3Shanghai Jiaotong University   4Stanford University
5Google Research
†project lead *equal technical contribution contribution Accepted to ECCV 2024 Paper PDF Project Page

TL;DR

This work presents Denoising Vision Transformers (DVT). It removes the visually annoying artifacts commonly seen in ViTs' feature maps and substaintially improves the downstream performance of dense recognition tasks.

teaser

Citation

@article{yang2024denoising,
  author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  title = {DVT: Denoising Vision Transformers},
  journal = {arXiv preprint arXiv:2401.02957},
  year = {2024},
}

This README file and codebase are legacy. We will update them soon.

Installation

  1. Create a conda environment.
conda create -n dvt python=3.10 -y
  1. Activate the environment.
conda activate dvt
  1. Install dependencies from requirements.txt.
pip install -r requirements.txt
  1. Install tiny-cuda-nn manually:
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error nvcc fatal : Unsupported gpu architecture compute_89, try the following command:

TCNN_CUDA_ARCHITECTURES=86 pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error: parameter packs not expanded with ‘...’, Refer to this solution on GitHub.

Data preparation

  1. PASCAL-VOC 2007 and 2012: Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder data, e.g.,
mkdir -p data
cd data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xf VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar

In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages and data/VOC2012/JPEGImages, excluding the validation images, and then randomly shuffling them.

  1. ADE20K: [legacy, need to check] Please download the ADE20K dataset and put the data in data/ADEChallengeData2016.

  2. NYU-D: Please download the NYU-depth dataset and put the data in data/nyu. Results are provided given the 2014 annotations following previous works.

  3. ImageNet (Optional):

Run the code

See sample_scripts for examples of running the code.

We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: Figure From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.

Main Results and Checkpoints

VOC Evaluation Results

mIoU aAcc mAcc Logfile
MAE 50.24 88.02 63.15 log
MAE + DVT 50.53 88.06 63.29 log
DINO 63.00 91.38 76.35 log
DINO + DVT 66.22 92.41 78.14 log
Registers 83.64 96.31 90.67 log
Registers + DVT 84.50 96.56 91.45 log
DeiT3 70.62 92.69 81.23 log
DeiT3 + DVT 73.36 93.34 83.74 log
EVA 71.52 92.76 82.95 log
EVA + DVT 73.15 93.43 83.55 log
CLIP 77.78 94.74 86.57 log
CLIP + DVT 79.01 95.13 87.48 log
DINOv2 83.60 96.30 90.82 log
DINOv2 + DVT 84.84 96.67 91.70 log

ADE20K Evaluation Results

mIoU aAcc mAcc Logfile
MAE 23.60 68.54 31.49 log
MAE + DVT 23.62 68.58 31.25 log
DINO 31.03 73.56 40.33 log
DINO + DVT 32.40 74.53 42.01 log
Registers 48.22 81.11 60.52 log
Registers + DVT 49.34 81.94 61.70 log
DeiT3 32.73 72.61 42.81 log
DeiT3 + DVT 36.57 74.44 49.01 log
EVA 37.45 72.78 49.74 log
EVA + DVT 37.87 75.02 49.81 log
CLIP 40.51 76.44 52.47 log
CLIP + DVT 41.10 77.41 53.07 log
DINOv2 47.29 80.84 59.18 log
DINOv2 + DVT 48.66 81.89 60.24 log

NYU-D Evaluation Results

RMSE Rel Logfile
MAE 0.6695 0.2334 log
MAE + DVT 0.7080 0.2560 log
DINO 0.5832 0.1701 log
DINO + DVT 0.5780 0.1731 log
Registers 0.3969 0.1190 log
Registers + DVT 0.3880 0.1157 log
DeiT3 0.588 0.1788 log
DeiT3 + DVT 0.5891 0.1802 log
EVA 0.6446 0.1989 log
EVA + DVT 0.6243 0.1964 log
CLIP 0.5598 0.1679 log
CLIP + DVT 0.5591 0.1667 log
DINOv2 0.4034 0.1238 log
DINOv2 + DVT 0.3943 0.1200 log

Denoiser Checkpoints

[ ] To be released.