beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
445 stars 22 forks source link

How to calculate the similarity between image-text pairs? #9

Closed Yu-xm closed 2 months ago

Yu-xm commented 2 months ago

How to calculate the similarity between image-text pairs?

beichenzbc commented 2 months ago

We adopt the same strategy as CLIP, using a simple matrix multiply to estimate the similarity.

You may refer to CLIP(https://github.com/openai/CLIP) for further details.

Yu-xm commented 2 months ago

Thank you for your reply. I have another question:

I tried to use the first example in Fig.5 in your paper to calculate the similarity, I reduced the caption, now I have three different lengths of caption to correspond to the same image. caption is as follows:

1、"Man in black jacket crosses city street with green light and colorful cars." 2、 "A man in a black jacket crosses a busy street with colorful cars, under a green traffic light, flanked by tall buildings and trees, under a clear sky." 3、"A man in a black jacket is crossing a busy city street. The street is filled with cars of various colors, including yellow taxis and red trucks. A traffic light hangs overhead, currently displaying a green signal. The perspective of the photo is from the sidewalk, giving a sense of being part of the city's hustle and bustle. The sky above is clear, suggesting good weather. The street is lined with tall buildings and trees, creating a vibrant cityscape."

The theoretical result should be that caption3 is more similar than caption2 and caption2 is similar than caption1, but the result is that caption2 is more similar than caption1 and caption1 is similar than caption3. Can you explain this phenomenon?

beichenzbc commented 2 months ago

Would you please share the relative score of these three captions.

As these three are all positive paris, and the second caption has already covered most of the useful information, we think it's quite reasonable if the gap is relatively small.