PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
723 stars 52 forks source link

Embedding similarity #60

Open akBear23 opened 3 months ago

akBear23 commented 3 months ago

The similarity between embeddings of text, video, audio, etc are not high, usually around 0.1 - 0.3, how do we know how relevant the embeddings are to each other? Can this encoder be trusted for downstream tasks such as semantic search in video? If so, what is the appropriate way to use these embeddings?