Official PyTorch implementation of "TransNeXt: Robust Foveal Visual Perception for Vision Transformers" [CVPR 2024] .
🤗 Don’t hesitate to give me a ⭐️, if you are interested in this project!
2024.06.08 We have created an explanatory video for our paper. You can watch it on YouTube or BiliBili.
2024.04.20 We have released the complete training and inference code, pre-trained model weights, and training logs!
2024.02.26 Our paper has been accepted by CVPR 2024! 🎉
2023.11.28 We have submitted the preprint of our paper to Arxiv
2023.09.21 We have submitted our paper and the model code to OpenReview, where it is publicly accessible.
:heavy_check_mark: Release of model code and CUDA implementation for acceleration.
:heavy_check_mark: Release of comprehensive training and inference code.
:heavy_check_mark: Release of pretrained model weights and training logs.
Our study observes unnatural visual perception (blocky artifacts) prevalent in the ERF of many visual backbones. We found that these artifacts, related to the design of their token mixers, are common in existing efficient ViTs and CNNs, and are difficult to eliminate through deep layer stacking. Our proposed pixel-focused attention and aggregated attention, designed through biomimicry, have provided a very effective solution for the artifact phenomenon, achieving natural and smooth visual perception.
Our study reevaluates the conventional belief of superior multi-scale adaptability in CNNs over ViTs. We highlight two key findings:
Classification code & weights & configs & training logs are >>>here<<<.
ImageNet-1K 224x224 pre-trained models:
Model | #Params | #FLOPs | IN-1K | IN-A | IN-C↓ | IN-R | Sketch | IN-V2 | Download | Config | Log |
---|---|---|---|---|---|---|---|---|---|---|---|
TransNeXt-Micro | 12.8M | 2.7G | 82.5 | 29.9 | 50.8 | 45.8 | 33.0 | 72.6 | model | config | log |
TransNeXt-Tiny | 28.2M | 5.7G | 84.0 | 39.9 | 46.5 | 49.6 | 37.6 | 73.8 | model | config | log |
TransNeXt-Small | 49.7M | 10.3G | 84.7 | 47.1 | 43.9 | 52.5 | 39.7 | 74.8 | model | config | log |
TransNeXt-Base | 89.7M | 18.4G | 84.8 | 50.6 | 43.5 | 53.9 | 41.4 | 75.1 | model | config | log |
ImageNet-1K 384x384 fine-tuned models:
Model | #Params | #FLOPs | IN-1K | IN-A | IN-R | Sketch | IN-V2 | Download | Config |
---|---|---|---|---|---|---|---|---|---|
TransNeXt-Small | 49.7M | 32.1G | 86.0 | 58.3 | 56.4 | 43.2 | 76.8 | model | config |
TransNeXt-Base | 89.7M | 56.3G | 86.2 | 61.6 | 57.7 | 44.7 | 77.0 | model | config |
ImageNet-1K 256x256 pre-trained model fully utilizing aggregated attention at all stages:
(See Table.9 in Appendix D.6 for details)
Model | Token mixer | #Params | #FLOPs | IN-1K | Download | Config | Log |
---|---|---|---|---|---|---|---|
TransNeXt-Micro | A-A-A-A | 13.1M | 3.3G | 82.6 | model | config | log |
Object detection code & weights & configs & training logs are >>>here<<<.
COCO object detection and instance segmentation results using the Mask R-CNN method:
Backbone | Pretrained Model | Lr Schd | box mAP | mask mAP | #Params | Download | Config | Log |
---|---|---|---|---|---|---|---|---|
TransNeXt-Tiny | ImageNet-1K | 1x | 49.9 | 44.6 | 47.9M | model | config | log |
TransNeXt-Small | ImageNet-1K | 1x | 51.1 | 45.5 | 69.3M | model | config | log |
TransNeXt-Base | ImageNet-1K | 1x | 51.7 | 45.9 | 109.2M | model | config | log |
COCO object detection results using the DINO method:
Backbone | Pretrained Model | scales | epochs | box mAP | #Params | Download | Config | Log |
---|---|---|---|---|---|---|---|---|
TransNeXt-Tiny | ImageNet-1K | 4scale | 12 | 55.1 | 47.8M | model | config | log |
TransNeXt-Tiny | ImageNet-1K | 5scale | 12 | 55.7 | 48.1M | model | config | log |
TransNeXt-Small | ImageNet-1K | 5scale | 12 | 56.6 | 69.6M | model | config | log |
TransNeXt-Base | ImageNet-1K | 5scale | 12 | 57.1 | 110M | model | config | log |
Semantic segmentation code & weights & configs & training logs are >>>here<<<.
ADE20K semantic segmentation results using the UPerNet method:
Backbone | Pretrained Model | Crop Size | Lr Schd | mIoU | mIoU (ms+flip) | #Params | Download | Config | Log |
---|---|---|---|---|---|---|---|---|---|
TransNeXt-Tiny | ImageNet-1K | 512x512 | 160K | 51.1 | 51.5/51.7 | 59M | model | config | log |
TransNeXt-Small | ImageNet-1K | 512x512 | 160K | 52.2 | 52.5/52.8 | 80M | model | config | log |
TransNeXt-Base | ImageNet-1K | 512x512 | 160K | 53.0 | 53.5/53.7 | 121M | model | config | log |
ADE20K semantic segmentation results using the Mask2Former method:
Backbone | Pretrained Model | Crop Size | Lr Schd | mIoU | #Params | Download | Config | Log |
---|---|---|---|---|---|---|---|---|
TransNeXt-Tiny | ImageNet-1K | 512x512 | 160K | 53.4 | 47.5M | model | config | log |
TransNeXt-Small | ImageNet-1K | 512x512 | 160K | 54.1 | 69.0M | model | config | log |
TransNeXt-Base | ImageNet-1K | 512x512 | 160K | 54.7 | 109M | model | config | log |
Before installing the CUDA extension, please ensure that the CUDA version on your machine (checked with nvcc -V
)
matches the CUDA version of PyTorch.
cd swattention_extension
pip install .
This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.
If you find our work helpful, please consider citing the following bibtex. We would greatly appreciate a star for this project.
@InProceedings{shi2023transnext,
author = {Dai Shi},
title = {TransNeXt: Robust Foveal Visual Perception for Vision Transformers},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {17773-17783}
}