NoelShin / reco

[NeurIPS'22] ReCo: Retrieve and Co-segment for Zero-shot Transfer
https://www.robots.ox.ac.uk/~vgg/research/reco/
MIT License
62 stars 6 forks source link

Discussing PWC Section #1

Open mhamilton723 opened 2 years ago

mhamilton723 commented 2 years ago

Hello, congrats on the release of your fantastic work. I love the fact that you can use language to prompt the segmentation, and we appreciate you citing and comparing against STEGO!

Wanted to quickly reach out with regards to how you want to collectively manage the Papers with code section on unsupervised segmentation. Because CLIP is trained with image-language pairs and you use this to generate the attention maps, I think this might fall under weakly supervised methods such as either of these:

https://paperswithcode.com/task/weakly-supervised-object-localization https://paperswithcode.com/task/weakly-supervised-semantic-segmentation

let me know what you think about this proposal and I'm happy to discuss it further. Congrats again on making your work public!

Best, Mark

hq-deng commented 2 years ago

I have the same confusion as you. It seems that this work is avoiding the confusion with the unsupervised learning setup, because it's claimed as zero-shot adaptation. But, the experiment show comparison with unsupervised segmentation method. If only use CLIP without labels, it may close to unsupervised setting. But the label for each image is provided, it should be weakly-supervised setting. Besides, the DenseCLIP is trained with pixel-level annotation. It could be a zero-shot task, not an unsupervised task. I'm thinking that if we only use a CLIP model (without sample or pixel label), how should we define the task? It's unfair for comparison with both unsupervised task and weakly-supervised task.

NoelShin commented 2 years ago

Hi both,

Thank you for your input.

First, to clarify a couple of points raised by @hq-deng's comment:

  1. The DenseCLIP model we use is not trained with pixel-level labels. Note that there are two models called DenseCLIP: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting" https://arxiv.org/pdf/2112.01518.pdf (which does use pixel-level labels) and "DenseCLIP: Extract Free Dense Labels from CLIP" https://arxiv.org/pdf/2112.01071.pdf (which does not). We use the latter.
  2. We do not use image-level labels in the process of ReCo inference or training the segmentation model (ReCo+). One of the backbones models we use in our experiments is pre-trained for classification on ImageNet with labels. Here we are following the use of the word "unsupervised" similarly to previous literature (such as PiCIE https://arxiv.org/pdf/2103.17070.pdf and Segsort https://arxiv.org/pdf/1910.06962.pdf).

@mhamilton723, we did not consider our work to be weakly supervised because we are not training for segmentation on images with class labels (in the same way that PiCIE does not refer to itself as weakly supervised). On the other hand, we recognise that there is a spectrum of supervision from zero supervision up to fully supervised, and that by using CLIP we are not at the "zero" end of the spectrum. As such, a precise name would be useful to avoid confusion.

In response to your question, we thought that perhaps "Unsupervised Semantic Segmentation with Language-Image Pre-training" could be a better fit for the task setting considered by ReCo (and the DenseCLIP baseline we consider in our paper). If this name seems appropriate to you both (feedback is highly welcome - it would be good for us to get the right name), we will create a branch for the task in Papers with code.

Gyungin

hq-deng commented 2 years ago

Hello @NoelShin ,

Thanks for your comprehensive answer. That is a novel and interesting concept of segmentation. Although this approach is difficult to define, you are bravely exploring it. Congratulations on your groundbreaking work.

mhamilton723 commented 2 years ago

Hey @NoelShin thanks for the detailed reply. I think it might be a good idea to split this leaderboard out for one that uses supervised pre-training as you suggested. In some sense text labels provide even more supervision than classes or tags which is why i originally suggested weakly supervised methods. Thanks for being flexible and understanding on this topic :)