Open patricklabatut opened 1 year ago
I would appreciate an example code for semantic segmentation. Can't do much with the model's output embeddings yet. Kindly point me out if I am overlooking a relevant reference.
STEGO, an unsupervised semantic segmentation model used DINO v1.
cc. @mhamilton723
I have created this ( https://github.com/itsprakhar/Downstream-Dinov2 ) repo where I am writing code for using Dinov2 for downstream tasks such as segmentation and classification, you can take look, Create an issue or help improve it :)
@itsprakhar Ideally, there should be no need for a mask/label for downstream tasks, right? (for self-sup)
@innat-asj, the pretraining does not require labels but finetuning for downstream tasks do. However the number of training samples required would be much less. The finetuning is kind of "few-shot fintetuning" you need some examples because that's how you tell the model what you really want!
The finetuning is kind of "few-shot fintetuning" you need some examples because that's how you tell the model what you really want!
Probably missed if it's also followed in the paper, for segmentation and depth estimation. Coz, even if I need a few samples, that approach would be understood as semi-supervised.
Now, as DINO is meant to be self-supervised, I was wondering do we have to have a fine-tune for downstream tasks using target signal or instead contrastive loss!
Hi @innat-asj
DINO (and DINOv2) are self supervised pretraining methods. Their goal is to create a pretrained vision encoder with only unlabeled data. This model can then output good embeddings that represent images.
They are not classification, segmentation or depth models. They are just pretrained encoders. You can, however, build a segmentation model using DINOv2, by adding a seg. / depth / classif. head and training the head. We show in the paper that the head can be extremely small (just a linear layer), be trained on very few samples (eg ~1k depth images for NYUv2) and still perform competitively, because the encoder outputs good representations. These heads still need labeled samples to be trained.
If you are looking unsupervised segmentation, [STEGO] is a method leveraging a DINO to do that.
[STEGO] https://arxiv.org/abs/2203.08414
@TimDarcet Thanks for the clarification.
Has anyone managed to reproduce the segmentation results (82.5mIoU) on the VOC pascal 2012 dataset?
how can I get the Semantic segmentation documentation and training code?
Related issues:
6
25
47
80
84
99