czczup / ViT-Adapter

[ICLR 2023 Spotlight] Vision Transformer Adapter for Dense Predictions
https://arxiv.org/abs/2205.08534
Apache License 2.0
1.27k stars 140 forks source link

Multi-modal pre-trained weight for ViT-B #92

Open seungyonglee0802 opened 1 year ago

seungyonglee0802 commented 1 year ago

In the paper, it is shown that replacing the ImageNet-22k pre-training with the multi-modal pre-training gives significant gain of performance, using ViT-Adapter-B (Mask R-CNN 3x + MS)

However, I couldn't find the multi-modal pre-trained ViT-B for ViT-Adapter-B.

Thank You in advance :)