SamsungLabs / AdaCLIP

This repository contains the code for AdaCLIP, a computation and latency-aware system for pragmatic multimodal video retrieval.
Other
10 stars 2 forks source link

Downstream inference using Faiss #1

Open feemthan opened 1 month ago

feemthan commented 1 month ago

Hello Team,

Thank you for your amazing work on this model. I was able to reproduce your remarkable results. I am looking to contribute and develop downstream inference using faiss but I am running to a lot of issues. The cosine similarity gives incorrect results.

08/03/2024 22:03:10 - INFO - main - Setup model... 08/03/2024 22:03:11 - INFO - main - Using CLIP pretrained weights... 08/03/2024 22:03:17 - INFO - main - Setup model done! Loaded existing embeddings. 08/03/2024 22:03:17 - INFO - main - Loading metadata... 08/03/2024 22:03:17 - INFO - main - Metadata loaded Top 5 results for 'a woman eating':

  1. Distance: 3.3629, Index: 126 Caption: 3d animation music video song Path: video7136.mp4
  2. Distance: 3.0853, Index: 759 Caption: there are some people flying in a helicopter Path: video7769.mp4
  3. Distance: 3.0298, Index: 769 Caption: two men examine a red lamborghini with no tires Path: video7779.mp4
  4. Distance: 3.0025, Index: 176 Caption: a man hugs another man in outer space Path: video7186.mp4
  5. Distance: 2.9686, Index: 55 Caption: a band performs Path: video7065.mp4

I use faiss.IndexFlatIP which is the inner product. How do I make better predictions on the MSRVTT dataset?

angelaaye commented 1 month ago

Hi @feemthan, thank you for your interest. Could you elaborate how you are obtaining the distance values? Does the "distance" you print refer to "cosine similarity"? Cosine similarity should return a max similarity of 1, but the distance values you have printed are around 3. Could you check that the embeddings you are using are normalized?