[request] Semantic segmentation documentation, training code and / or model weights

facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.

Apache License 2.0

9.35k stars 835 forks source link

[request] Semantic segmentation documentation, training code and / or model weights #55

Open patricklabatut opened 1 year ago

patricklabatut commented 1 year ago

Related issues:

kanishkanarch commented 1 year ago

I would appreciate an example code for semantic segmentation. Can't do much with the model's output embeddings yet. Kindly point me out if I am overlooking a relevant reference.

innat-asj commented 1 year ago

STEGO, an unsupervised semantic segmentation model used DINO v1.

cc. @mhamilton723

itsprakhar commented 1 year ago

I have created this ( https://github.com/itsprakhar/Downstream-Dinov2 ) repo where I am writing code for using Dinov2 for downstream tasks such as segmentation and classification, you can take look, Create an issue or help improve it :)

Downstream Dinov2 Segmentation and Classification

innat-asj commented 1 year ago

@itsprakhar Ideally, there should be no need for a mask/label for downstream tasks, right? (for self-sup)

itsprakhar commented 1 year ago

@innat-asj, the pretraining does not require labels but finetuning for downstream tasks do. However the number of training samples required would be much less. The finetuning is kind of "few-shot fintetuning" you need some examples because that's how you tell the model what you really want!

innat-asj commented 1 year ago

The finetuning is kind of "few-shot fintetuning" you need some examples because that's how you tell the model what you really want!

Probably missed if it's also followed in the paper, for segmentation and depth estimation. Coz, even if I need a few samples, that approach would be understood as semi-supervised.

Now, as DINO is meant to be self-supervised, I was wondering do we have to have a fine-tune for downstream tasks using target signal or instead contrastive loss!

TimDarcet commented 1 year ago

Hi @innat-asj

DINO (and DINOv2) are self supervised pretraining methods. Their goal is to create a pretrained vision encoder with only unlabeled data. This model can then output good embeddings that represent images.

They are not classification, segmentation or depth models. They are just pretrained encoders. You can, however, build a segmentation model using DINOv2, by adding a seg. / depth / classif. head and training the head. We show in the paper that the head can be extremely small (just a linear layer), be trained on very few samples (eg ~1k depth images for NYUv2) and still perform competitively, because the encoder outputs good representations. These heads still need labeled samples to be trained.

If you are looking unsupervised segmentation, [STEGO] is a method leveraging a DINO to do that.

[STEGO] https://arxiv.org/abs/2203.08414

innat-asj commented 1 year ago

@TimDarcet Thanks for the clarification.