baaivision / EVE

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
MIT License
238 stars 3 forks source link

Question about loss function #5

Closed Andrew-Zhang closed 4 months ago

Andrew-Zhang commented 4 months ago

Hello! Very cool project! In the paper, I saw that MSE loss is used between EVE and the image encoder. However, in the code, it looks like cosine similarity is used:

https://github.com/baaivision/EVE/blob/b34b2b4f12ddf429137ed94c7d44a93f54ab9d79/eve/model/multimodal_encoder/vision_tokenizer.py#L176-L178

Could you tell me the motivation of using cosine similarity vs MSE? Thanks!

Paranioar commented 4 months ago

Hello! Very cool project! In the paper, I saw that MSE loss is used between EVE and the image encoder. However, in the code, it looks like cosine similarity is used:

https://github.com/baaivision/EVE/blob/b34b2b4f12ddf429137ed94c7d44a93f54ab9d79/eve/model/multimodal_encoder/vision_tokenizer.py#L176-L178

Could you tell me the motivation of using cosine similarity vs MSE? Thanks!

Hey, in the paper, we claim that we calculate the MSE loss between normalized features. Here, the way we implement it is actually equivalent.