NVlabs / GroupViT

Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.
https://arxiv.org/abs/2202.11094
Other
705 stars 53 forks source link

Question about zero-shot transfer to semantic segmentation #22

Closed CyrilKZ closed 2 years ago

CyrilKZ commented 2 years ago

Hi, thank you for your great work.

I noticed that during the generation of segmentation masks, soft assignment matrices are used instead of hard assigment matrices (from segmentation/evaluation/group_vit_seg.py, line 166). Although the product of the soft assignment metrices is converted to one-hot before classifying pixels, it is somewhat different from your paper, which suggests that we should directly multiply hard assigment matrices.

In fact, by changing the code in line 166 from attn_masks = attn_dict['soft'] to attn_masks = attn_dict['hard'], the demo yields worse segmentation result.

Am I misunderstanding the code or missing some implementation details from your paper?

xvjiarui commented 2 years ago

Hi @CyrilKZ During training, we use the hard assignment. The soft yields better results for evaluation.

CyrilKZ commented 2 years ago

Thanks for your replay :)

pzhren commented 1 year ago

Thanks for your replay :) Hi, I also tested the performance of groupvit on ADE and cityscape, and it is only about 6%, I don't know if I am mistaken. If so, why is it so low?

xvjiarui commented 1 year ago

HI @pzhren In the inference pipeline, we always resize to 448 short side. And due to the patch dividing process, GroupViT may miss segments of size smaller than 16px. On detailed high-res dataset like ADE and cityscapes.some classes are too small for GroupVIT