In the paper, it is shown that replacing the ImageNet-22k pre-training with the multi-modal pre-training gives significant gain of performance, using ViT-Adapter-B (Mask R-CNN 3x + MS)
However, I couldn't find the multi-modal pre-trained ViT-B for ViT-Adapter-B.
In the paper, it is shown that replacing the ImageNet-22k pre-training with the multi-modal pre-training gives significant gain of performance, using ViT-Adapter-B (Mask R-CNN 3x + MS)
However, I couldn't find the multi-modal pre-trained ViT-B for ViT-Adapter-B.
Thank You in advance :)