How $Q_{label}$ is updated?

lchen1019 commented 2 months ago

Hi. As mentioned in your paper, $Q{label}$ is the key to CLIP2SAM. I noticed that $Q{label}$ is a learnable token, am I right? And the paper metioned that: 'The final labels are derived by calculating the distance between the refined label token and the CLIP text embedding, as in Equ. (1)'. It means $Q_{label}$ is aligned with text embeddings, and then get the class label through cosine similarity. However, I found that in your code, the roi embeddings is not include Q, as follows, https://github.com/HarborYuan/ovsam/blob/137d2c2e6daea060668cf50d7c966ed86e9c45ce/seg/models/heads/ovsam_head.py#L219 So where does $Q_{label}$ get the gradient for updating? This confuses me. Looking forward to your reply. Thank you in advance!

HarborYuan commented 2 months ago

Hi @lchen1019

In our paper, the $Q_{label}$ is used to describe the "straightforward approach" (Please refer to section 3.2). The code in this repo corresponds to the FPN approach, which is adopted by the final ovsam.

The cis_embd here has no effect at all.

Hope this can help.

lchen1019 commented 2 months ago

No wonder... thank you a lot!

HarborYuan / ovsam

How $Q_{label}$ is updated? #42