Closed mcaccin closed 4 months ago
Thank you for your question! We cite both the two references CLIPScore and CLIP adhere to CLIPScore calculation in our paper. Here we use the calculation method from CLIP repository :
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
And here we omit the constant multiplier 100 and the softmax function in our approach Hope this helps!
Thanks for the swift reply and for providing these pointers!
I'm with you on the part where you say that the implementation in this repo follows the calculation of CLIP similarity from the OpenAI repo and paper.
What I'm highlighting, though, is that this similarity measure does not seem to be what the CLIPScore paper proposes as a metric, see "Section 3: CLIPScore" because of the added max
operator (a ReLU on the similarity
, if you wish...).
Now, I am not entirely sure what the rationale for the value clipping in the CLIPScore paper is (besides having non-negative metric values, which could be better achieved in not lossy ways such as a simple scale+shift), but the two references seem incompatible with each other, and that's where my confusion stems from.
Hi! As it is stated in CLIPScore paper, it is used for RefCLIPScore: "Across all of these cases, we never observed a negative negative cosine similarity. But, to be safe, we take a maximum between the cosine similarity and zero because the harmonic mean used to compute RefCLIPScore would be undefined for negative values."
Thank you so much for the explanation, I did not consider the empirical fact when opening this issue. Closing it 🙂
Hi! Is it possible that the CLIP score implementation of this repo is incorrect? The original paper states that the score between 2 embeddings
c
andv
should be computed as, but the
max
operation is missing from the implementation here (we can ignore thew
factor since it is a constant and only changes the range of values).For reference, the CLIPScore implementation in torchmetrics is consistent with the paper (in their case,
w = 100
).Is this an oversight, or is the implementation here referring to a different definition of the score?
Thanks for sharing your work 🙂