Hi,
Thanks for your great work.
I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.
In the paper, the embedding arithmetic is described as follows:
For arithmetic, we again use the
embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling
them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described
above.
To obtain the embedding features after temperature scaling can I just use the following code?:
Then, I just use the combined_embs and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?
I apologize for the long post.
I greatly appreciate any tips and advice on how to approach this issue.
Hi, Thanks for your great work. I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.
In the paper, the embedding arithmetic is described as follows:
To obtain
the embedding features after temperature scaling
can I just use the following code?:which applies normalization and temperature scaling for each modality (with except for the image modality where it only applies normalization) or should I modify the way the embeddings are returned by removing the normalization part and only do temperature scaling? https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#LL422C1-L424C10
After obtaining
the embedding features after temperature scaling
, do I need to apply anotherℓ2 normalization
, something like:and then combine the embeddings of the two modalities?:
Then, I just use the
combined_embs
and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?I apologize for the long post. I greatly appreciate any tips and advice on how to approach this issue.
Many thanks!