CLIPScore implementation inconsistent with original paper?

Karine-Huang / T2I-CompBench

[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

https://arxiv.org/pdf/2307.06350.pdf

MIT License

168 stars 5 forks source link

CLIPScore implementation inconsistent with original paper? #14

Closed mcaccin closed 4 months ago

mcaccin commented 4 months ago

Hi! Is it possible that the CLIP score implementation of this repo is incorrect? The original paper states that the score between 2 embeddings c and v should be computed as

score = w * max(cos(c, v), 0)

, but the max operation is missing from the implementation here (we can ignore the w factor since it is a constant and only changes the range of values).

For reference, the CLIPScore implementation in torchmetrics is consistent with the paper (in their case, w = 100).

Is this an oversight, or is the implementation here referring to a different definition of the score?

Thanks for sharing your work 🙂

Karine-Huang commented 4 months ago

Thank you for your question! We cite both the two references CLIPScore and CLIP adhere to CLIPScore calculation in our paper. Here we use the calculation method from CLIP repository :

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

And here we omit the constant multiplier 100 and the softmax function in our approach Hope this helps!

mcaccin commented 4 months ago

Thanks for the swift reply and for providing these pointers! I'm with you on the part where you say that the implementation in this repo follows the calculation of CLIP similarity from the OpenAI repo and paper. What I'm highlighting, though, is that this similarity measure does not seem to be what the CLIPScore paper proposes as a metric, see "Section 3: CLIPScore" because of the added max operator (a ReLU on the similarity, if you wish...).

Now, I am not entirely sure what the rationale for the value clipping in the CLIPScore paper is (besides having non-negative metric values, which could be better achieved in not lossy ways such as a simple scale+shift), but the two references seem incompatible with each other, and that's where my confusion stems from.

Karine-Huang commented 4 months ago

Hi! As it is stated in CLIPScore paper, it is used for RefCLIPScore: "Across all of these cases, we never observed a negative negative cosine similarity. But, to be safe, we take a maximum between the cosine similarity and zero because the harmonic mean used to compute RefCLIPScore would be undefined for negative values."

mcaccin commented 4 months ago

Thank you so much for the explanation, I did not consider the empirical fact when opening this issue. Closing it 🙂