baaivision / EVA

EVA Series: Visual Representation Fantasies from BAAI
MIT License
2.32k stars 167 forks source link

[EVA-CLIP] Are you sure about your way to compute probabilities? #140

Closed paulgavrikov closed 8 months ago

paulgavrikov commented 9 months ago

Hi,

Thank you so much for these very cool models. In your docs, you compute the probability by: text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1).

I am wondering about the 100.0 * multiplier. In the original OpenAI model this is because 100 = model.logit_scale.exp() (see forward-pass of CLIP). However, for your models I get very different measurements.

Obviously, this only changes confidence - not accuracy but could you kindly explain what the correct multiplier is for your models?

Quan-Sun commented 8 months ago

Hi @paulgavrikov, apologies for the delayed response. We fixed the logit_scale (logit_scale.exp() is consistently set to 100) during the training of EVA-CLIP-8B and EVA-CLIP-18B models. Please ignore the logit_scale in models on Hugging Face as it was affected by a mistake in the weights transformation process, but it didn't affect the usage of EVA-CLIP models.