CLIP text model output. How/Why two outputs word_features & sentence features?

3dlg-hcvc / M3DRef-CLIP

[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects

https://3dlg-hcvc.github.io/multi3drefer/

MIT License

64 stars 3 forks source link

CLIP text model output. How/Why two outputs word_features & sentence features? #4

Closed kochsebastian closed 10 months ago

kochsebastian commented 10 months ago

I was wondering why you are expecting two outputs when calling word_features, sentence_features = self.clip_model.encode_text(clip_tokens) here

As far as I understand, you are using a vanilla clip model which outputs only one embedding for clip_model.encode_text(). Evidently this cant be the case since you are expecting two different embeddings. So where did you implement the custom functionality to get two embeddings from encode_text()?

Xiaolong-RRL commented 10 months ago

Hi, see here

eamonn-zh commented 10 months ago

Yes, exactly. @Xiaolong-RRL Thank you for helping answer it!