Question about loss function

baaivision / EVE

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models

MIT License

238 stars 3 forks source link

Hello! Very cool project! In the paper, I saw that MSE loss is used between EVE and the image encoder. However, in the code, it looks like cosine similarity is used:

https://github.com/baaivision/EVE/blob/b34b2b4f12ddf429137ed94c7d44a93f54ab9d79/eve/model/multimodal_encoder/vision_tokenizer.py#L176-L178

Could you tell me the motivation of using cosine similarity vs MSE? Thanks!

Hey, in the paper, we claim that we calculate the MSE loss between normalized features. Here, the way we implement it is actually equivalent.

baaivision / EVE

Question about loss function #5