Awesome Vision Transformer Collection
Variants of Vision Transformer and Vision Transformer for Downstream Tasks
author: Runwei Guan
affiliation: University of Liverpool / JITRI-Institute of Deep Perception Technology
email: thinkerai@foxmail.com / Runwei.Guan@liverpool.ac.uk / guanrunwei@idpt.org
Image Backbone
- Vision Transformer paper code
- Swin Transformer paper code
- Swin Transformer V2: Scaling Up Capacity and Resolution paper code
- DVT paper code
- PVT paper code
- Lite Vision Transformer: LVT paper
- PiT paper code
- Twins paper code
- TNT paper code
- Mobile-ViT paper code
- Cross-ViT paper code
- LeViT paper code
- ViT-Lite paper
- Refiner paper code
- DeepViT paper code
- CaiT paper code
- LV-ViT paper code
- DeiT paper code
- CeiT paper code
- BoTNet paper
- ViTAE paper
- Visformer: The Vision-Friendly Transformer paper code
- Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training paper
- AdaViT: Adaptive Tokens for Efficient Vision Transformer paper
- Improved Multiscale Vision Transformers for Classification and Detection paper
- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding paper
- Towards End-to-End Image Compression and Analysis with Transformers paper
- MPViT: Multi-Path Vision Transformer for Dense Prediction paper
- Lite Vision Transformer with Enhanced Self-Attention paper
- PolyViT: Co-training Vision Transformers on Images, Videos and Audio paper
- MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation paper
- ELSA: Enhanced Local Self-Attention for Vision Transformer paper
- Vision Transformer for Small-Size Datasets paper
- SimViT: Exploring a Simple Vision Transformer with sliding windows paper
- SPViT: Enabling Faster Vision Transformers via Soft Token Pruning paper
- Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space paper
- Vision Transformer with Deformable Attention paper code
- PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture paper
- QuadTree Attention for Vision Transformers paper code
- TerViT: An Efficient Ternary Vision Transformer paper
- BViT: Broad Attention based Vision Transformer paper
- CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction paper
- EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers paper
- Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention paper
- Coarse-to-Fine Vision Transformer paper
- ViT-P: Rethinking Data-efficient Vision Transformers from Locality paper
- MPViT: Multi-Path Vision Transformer for Dense Prediction paper
- Event Transformer paper
- DaViT: Dual Attention Vision Transformers paper
- LightViT: Towards Light-Weight Convolution-Free Vision Transformers paper
- UniNet: Unified Architecture Search with Convolution, Transformer, and MLP paper
- Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning paper
- EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications paper
Multi-label Classification
- Graph Attention Transformer Network for Multi-Label Image Classification paper
Point Cloud Processing
- Point Cloud Transformer paper
- Point Transformer paper
- Fast Point Transformer paper
- Adaptive Channel Encoding Transformer for Point Cloud Analysis paper
- PTTR: Relational 3D Point Cloud Object Tracking with Transformer paper
- Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction paper
- LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling paper
- Geometric Transformer for Fast and Robust Point Cloud Registration paper
- HiTPR: Hierarchical Transformer for Place Recognition in Point Cloud paper
Video Processing
- Video Transformers: A Survey paper
- ViViT: A Video Vision Transformer paper
- Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos paper
- LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach paper
- Video Joint Modelling Based on Hierarchical Transformer for Co-summarization paper
- InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer paper
- TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers paper
- Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning paper
- Multiview Transformers for Video Recognition paper
- MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition paper
- Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval paper
- A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection paper
- Learning Trajectory-Aware Transformer for Video Super-Resolution paper
- Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer paper
Model Compression
- A Unified Pruning Framework for Vision Transformers paper
- Multi-Dimensional Model Compression of Vision Transformer paper
- Contextformer: A Transformer with Spatio-Channel Attention for Context Modeling in Learned Image Compression paper
Transfer Learning & Pretraining
- Pre-Trained Image Processing Transformer paper code
- UP-DETR: Unsupervised Pre-training for Object Detection with Transformers paper code
- BEVT: BERT Pretraining of Video Transformers paper
- Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text paper
- On Efficient Transformer and Image Pre-training for Low-level Vision paper
- Pre-Training Transformers for Domain Adaptation paper
- RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training paper
- Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion paper
- DiT: Self-supervised Pre-training for Document Image Transformer paper
- Underwater Image Enhancement Using Pre-trained Transformer paper
Multi-Modal
- Multi-Modal Fusion Transformer for End-to-End Autonomous Driving paper
- Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval paper
- LAVT: Language-Aware Vision Transformer for Referring Image Segmentation paper
- MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection paper
- Visual-Semantic Transformer for Scene Text Recognition paper
- Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text paper
- LaTr: Layout-Aware Transformer for Scene-Text VQA paper
- Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding paper
- Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation paper
- Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg paper
- On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering paper
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers paper
- CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers paper
- VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer paper
- Knowledge Amalgamation for Object Detection with Transformers paper
- Are Multimodal Transformers Robust to Missing Modality? paper
- Self-supervised Vision Transformers for Joint SAR-optical Representation Learning paper
- Video Graph Transformer for Video Question Answering paper
Detection
- YOLOS: You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection paper code
- WB-DETR: Transformer-Based Detector without Backbone paper
- UP-DETR: Unsupervised Pre-training for Object Detection with Transformers paper
- TSP: Rethinking Transformer-based Set Prediction for Object Detection paper
- DETR paper code
- Deformable DETR paper code
- DN-DETR: Accelerate DETR Training by Introducing Query DeNoising paper code
- Rethinking Transformer-Based Set Prediction for Object Detection paper
- End-to-End Object Detection with Adaptive Clustering Transformer paper
- An End-to-End Transformer Model for 3D Object Detection paper
- End-to-End Human Object Interaction Detection with HOI Transformer paper code
- Adaptive Image Transformer for One-Shot Object Detection paper
- Improving 3D Object Detection With Channel-Wise Transformer paper
- TransPose: Keypoint Localization via Transformer paper
- Voxel Transformer for 3D Object Detection paper
- Embracing Single Stride 3D Object Detector with Sparse Transformer paper
- OW-DETR: Open-world Detection Transformer paper
- A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation paper
- Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence paper
- Voxel Transformer for 3D Object Detection paper
- Short Range Correlation Transformer for Occluded Person Re-Identification paper
- TransVPR: Transformer-based place recognition with multi-level attention aggregation paper
- Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond paper
- Arbitrary Shape Text Detection using Transformers paper
- A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial transformer network paper
- A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection paper
- Knowledge Amalgamation for Object Detection with Transformers paper
- SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection paper
- POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition paper
- PSTR: End-to-End One-Step Person Search With Transformers paper
- Scaling Novel Object Detection with Weakly Supervised Detection Transformers paper
- OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers paper
- Exploring Plain Vision Transformer Backbones for Object Detection paper
Segmentation
- Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation paper code
- Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention paper code
- MaX-DeepLab: End-to-End Panoptic Segmentation With Mask Transformers paper code
- Line Segment Detection Using Transformers without Edges paper
- VisTR: End-to-End Video Instance Segmentation with Transformers paper code
- SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers paper code
- Segmenter: Transformer for Semantic Segmentation paper
- Fully Transformer Networks for Semantic Image Segmentation paper
- SOTR: Segmenting Objects with Transformers paper code
- GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation paper
- Masked-attention Mask Transformer for Universal Image Segmentation paper
- A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation paper
- iSegFormer: Interactive Image Segmentation with Transformers paper
- SOIT: Segmenting Objects with Instance-Aware Transformers paper
- SeMask: Semantically Masked Transformers for Semantic Segmentation paper
- Siamese Network with Interactive Transformer for Video Object Segmentation paper
- Pyramid Fusion Transformer for Semantic Segmentation paper
- Swin transformers make strong contextual encoders for VHR image road extraction paper
- Transformers in Action:Weakly Supervised Action Segmentation paper
- Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation paper
- Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers paper
- Contextual Attention Network: Transformer Meets U-Net paper
- TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation paper
Pose Estimation
- Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation paper
- HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation paper
- End-to-End Human Pose and Mesh Reconstruction with Transformers paper code
- PE-former: Pose Estimation Transformer paper
- Pose Recognition with Cascade Transformers paper code
- Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer code
- Geometry-Contrastive Transformer for Generalized 3D Pose Transfer paper
- Temporal Transformer Networks with Self-Supervision for Action Recognition paper
- Co-training Transformer with Videos and Images Improves Action Recognition paper
- DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer paper
- Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition paper
- Motion-Aware Transformer For Occluded Person Re-identification paper
- HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders paper
- ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers paper
- Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding paper
- Spatial Transformer Network on Skeleton-based Gait Recognition paper
Tracking and Trajectory Prediction
- Transformer Tracking paper code
- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking paper code
- MOTR: End-to-End Multiple-Object Tracking with TRansformer paper code
- SwinTrack: A Simple and Strong Baseline for Transformer Tracking paper
- Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network paper
- PTTR: Relational 3D Point Cloud Object Tracking with Transformer paper
- Efficient Visual Tracking with Exemplar Transformers paper
- TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer paper
Generative Model and Denoising
- 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds paper
- Spatial-Temporal Transformer for Dynamic Scene Graph Generation paper
- THUNDR: Transformer-Based 3D Human Reconstruction With Markers paper
- DoodleFormer: Creative Sketch Drawing with Transformers paper
- Image Transformer paper
- Taming Transformers for High-Resolution Image Synthesis paper code
- TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up code
- U2-Former: A Nested U-shaped Transformer for Image Restoration paper
- Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers paper
- SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers paper
- StyleSwin: Transformer-based GAN for High-resolution Image Generation paper
- Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction paper
- SGTR: End-to-end Scene Graph Generation with Transformer paper
- Flow-Guided Sparse Transformer for Video Deblurring paper
- Spherical Transformer paper
- MaskGIT: Masked Generative Image Transformer paper
- Entroformer: A Transformer-based Entropy Model for Learned Image Compression paper
- UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation paper
- Stripformer: Strip Transformer for Fast Image Deblurring paper
- Vision Transformers for Single Image Dehazing paper
- Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer paper
Self-Supervised Learning
- Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning paper code
- iGPT paper code
- An Empirical Study of Training Self-Supervised Vision Transformers paper code
- Self-supervised Video Transformer paper
- TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning paper
- TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning paper
- Transformers in Action:Weakly Supervised Action Segmentation paper
- Motion-Aware Transformer For Occluded Person Re-identification paper
- Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics paper
- Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut paper
- Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers paper
- Multi-class Token Transformer for Weakly Supervised Semantic Segmentation paper
- Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers paper
- DiT: Self-supervised Pre-training for Document Image Transformer paper
- Self-supervised Vision Transformers for Joint SAR-optical Representation Learning paper
- DILEMMA: Self-Supervised Shape and Texture Learning with Transformers paper
Depth and Height Estimation
- Disentangled Latent Transformer for Interpretable Monocular Height Estimation paper
- Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics paper
- SiaTrans: Siamese Transformer Network for RGB-D Salient Object Detection with Depth Image Classification paper
Explainable
- Development and testing of an image transformer for explainable autonomous driving systems paper
- Transformer Interpretability Beyond Attention Visualization paper code
- How Do Vision Transformers Work? paper
- eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation paper
Robustness
- Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding paper
Deep Reinforcement Learning
- Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels paper
Calibration
- CTRL-C: Camera Calibration TRansformer With Line-Classification paper code
Radar
- Learning class prototypes from Synthetic InSAR with Vision Transformers paper
- Radar Transformer paper
Traffic
- SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers paper
AI Medicine
- Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer paper
- 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis paper
- Hformer: Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks paper
- MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification paper
- MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer paper
- Generalized Wasserstein Dice Loss, Test-time Augmentation, and Transformers for the BraTS 2021 challenge paper
- D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation paper
- RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark paper
- Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images paper
- Swin Transformer for Fast MRI paper code
- Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are? paper
- ViTBIS: Vision Transformer for Biomedical Image Segmentation paper
- SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation paper
- Improving Across-Dataset Brain Tissue Segmentation Using Transformer paper
- Brain Cancer Survival Prediction on Treatment-naive MRI using Deep Anchor Attention Learning with Vision Transformer paper
- Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers paper
- AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation paper
- Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification paper
- Characterizing Renal Structures with 3D Block Aggregate Transformers paper
- Multimodal Transformer for Nursing Activity Recognition paper
- RTN: Reinforced Transformer Network for Coronary CT Angiography Vessel-level Image Quality Assessment paper
- Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays paper
Hardware
- VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer paper