Image-Text Similarity Score

Dear Authors,

Thanks for open-sourcing the work!

I tried to understand the image-text similarity score:

logits_per_image = outputs.logits_per_image  # this is the image-text similarity score

in Quick Start:

from PIL import Image
from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)

Specifically, I have difficulty understanding using logits to represent a image-text similarity score. It would be great to further explain that.

To me, a image-text similarity score is computed by the similarity between image and text in the feature space. Two examples would be to measure the distance ||image_features - text_features || or cosine similarity cos(image_features - text_features) from the following code:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu', pretrained='metaclip_400m')  # for 2.5B use 'metaclip_fullcc' in OpenCLIP or 'metaclip_2_5b' in this repo

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

facebookresearch / MetaCLIP

Image-Text Similarity Score #57