facebookresearch / ImageBind

ImageBind One Embedding Space to Bind Them All
Other
8.26k stars 758 forks source link

help with embedding arithmetic and image retrieval #60

Open bakachan19 opened 1 year ago

bakachan19 commented 1 year ago

Hi, Thanks for your great work. I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.

In the paper, the embedding arithmetic is described as follows:

For arithmetic, we again use the
embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling
them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described
above. 

To obtain the embedding features after temperature scaling can I just use the following code?:

########## - step 1 - ########## 
# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

which applies normalization and temperature scaling for each modality (with except for the image modality where it only applies normalization) or should I modify the way the embeddings are returned by removing the normalization part and only do temperature scaling? https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#LL422C1-L424C10

After obtaining the embedding features after temperature scaling, do I need to apply another ℓ2 normalization, something like:

########## - step 2 - ########## 
img_embedding = embeddings[ModalityType.VISION]
txt_embedding = embeddings[ModalityType.TEXT]

img_embedding = img_embedding / torch.norm(img_embedding, dim=-1, keepdim=True)
txt_embedding = txt_embedding / torch.norm(txt_embedding, dim=-1, keepdim=True)

and then combine the embeddings of the two modalities?:

combined_embs = 0.5* img_embedding + 0.5* txt_embedding

Then, I just use the combined_embs and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?

I apologize for the long post. I greatly appreciate any tips and advice on how to approach this issue.

Many thanks!

gorjanradevski commented 9 months ago

I would also like to hear the authors opinion on this.

SenmiaoORZ commented 9 months ago

Same here

bakachan19 commented 4 months ago

@gorjanradevski , @SenmiaoORZ did you guys have perhaps any new insights regarding this? I'm still curious about it. thank you.