SHI-Labs / OneFormer

OneFormer: One Transformer to Rule Universal Image Segmentation, arxiv 2022 / CVPR 2023
https://praeclarumjj3.github.io/oneformer
MIT License
1.41k stars 128 forks source link

A question about paper and code #24

Closed thiswinex closed 1 year ago

thiswinex commented 1 year ago

Thanks for sharing your gread job! I have serveral questions during reading the paper and codes. Hope to disguss.

  1. About Contrastive Loss: According to the paper, T_pad is a list of representation for each mask-to-be-detected in image. How this relationship is keeped in training process? I found "i" in pairs {qobj_i, xtxt_i} represent index in code, so it seems like q^text always matches the q^obj with same index. But we shouldn't meant to know which object the q^obj represent before DETR decoder inference. Did I misunderstand the paper?

  2. Table 6 of ablation study confuse me. It seems like ablation is about some kind of prompt engineering (of course it's not). I still can't get why adding "a photo with a" can raise model performances. Is this paper use pretrained text encoder? Do you have any new idea or explanation about this ablation?

praeclarumjj3 commented 1 year ago

Hi @thiswinex, thanks for your interest in our work. Please find the answers to your questions:

  1. This is a valid concern that we do not know which index represents which object. It is an inherent bias in the model that we assume the object queries are mapped to the correct indices in text queries, i.e., all exist in an ordered manner. We do plan on analyzing this in a future version of our work.

  2. The prompts play a critical role in the performance of vision language models. We do not use a pretrained text encoder. Following CLIP, we use "the photo with a {CLS}" as our default prompt. We also try some other prompts for exploration purposes. There remains scope for performance improvements with prompt engineering in future work.

thiswinex commented 1 year ago

Thank you for your patient reply!

About Contrastive Loss: I get your point. But the code still use Hungarian Matcher as label assigner in training. Shouldn't this inherent bias be contradictory to Hungarian matcher in some way? Would it be a better match strategy to match mask labels and queries also in order?

praeclarumjj3 commented 1 year ago

Thanks for your valid question. A matching strategy using simple ordering induces extra constraints. Hungarian matching would eventually select the correct indices anyway. Although this might be a good analysis experiment as I mentioned earlier.

thiswinex commented 1 year ago

Appreciate for your reply. Can't wait to see your future work. :)