Closed chokevin8 closed 6 months ago
Hi,
The aim of this paper was not to achieve strong performance on Camelyon16 (even though we get very high AUC) but rather to compare Vim and ViT under a similar self-supervised framework, specifically DINO, given the robust performance and appealing properties of ViTs trained with that framework. We wanted to show that Vim can outperform ViT models in a framework that favors ViT encoders.
We haven't tested DINOv2, and it could be a framework to explore. However, note the following from the second DINOv2 paper:
When used to extract features, it delivers disappointing performance, only on par with supervised alternative backbones in this scenario. This suggests that DINOv2 behaves differently than DINO. The investigation described in this work notably exposes the presence of artefacts in the feature maps of DINOv2 that were not present in the first version of this model.
Hence, a model trained with DINOv2 might not have similar properties as discussed in section "5.4. Explainability" of our paper.
Let me know if you have any questions.
Thank you for your comment, really appreciate it! Yes, I am aware of your purpose of the paper, which was to compare and contrast Vim and ViT. I was just wondering because many other vision foundational models in pathology and other vision modalities utilize DINOv2 for self-distillation and achieve good performance.
I overlooked the explainability part, thank you for mentioning that. If using the CLS tokens to create the attention maps like you guys, maybe we can utilize the register tokens (from the paper that you mentioned) during training? Maybe I will try to work with DINOv2 and see how it does with and without registers. Thank you for your response!
Thank you for your interest in our work! Since Vim models are new types of encoders, they have yet to be explored in many self-supervised settings. Training them within DINOv2 with/without registers is an exciting topic. Since their behavior differs from ViT models, they might work fine without registers or, in contrast, greatly benefit from them.
Looking forward to seeing your results!
Is there any reason why you guys utilized DINO instead of DINOv2? Was performance worse when using DINOv2?