A principle question of how to map to the same space

WingFLY000 commented 3 months ago

Thank you for your open source. I have a question to ask you; We know that text and image features are usually not mapped in the same space. So, how did you solve the problem of text and image features not being in the same space when calculating cosine similarity．

WentaoTan commented 3 months ago

Thank you for your interest in our work!

Before images and text enter their respective encoders, they exist in different feature spaces. However, as they pass through multiple layers of the transformer, the modality gap between them gradually decreases. This reduction is facilitated by the use of effective metric losses such as contrastive loss and similarity distribution matching loss. These losses enable CLIP to bring the two models closer together, eventually embedding them into the same feature space.

Additionally, CLIP trains the image encoder and text encoder simultaneously using image-text pairs, unlike the separate training of models like BERT and ResNet. This joint training approach makes it easier to narrow the distance between the two modalities.

Ultimately, we believe that computing the similarity between the two modalities using deep features is feasible, and our experiments have confirmed this.

WingFLY000 commented 3 months ago

Thank you for your answer

littlexinyi commented 2 months ago

Thanks for answering! I have some questions about the common V&L embedding space. How to build it in the aligning process? By a common Projection layer across modalities?

WentaoTan commented 2 months ago

Certainly, in the CLIP model, the final image and text features are each mapped to the V&L embedding space through their respective projectors.

------------------ 原始邮件 ------------------ 发件人: "WentaoTan/MLLM4Text-ReID" @.>; 发送时间: 2024年6月25日(星期二) 晚上8:59 @.>; @.**@.>; 主题: Re: [WentaoTan/MLLM4Text-ReID] A principle question of how to map to the same space (Issue #3)

Thanks for answering! I have some questions about the common V&L embedding space. How to build it in the aligning process? By a common Projection layer across modalities?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

WentaoTan / MLLM4Text-ReID

A principle question of how to map to the same space #3