Closed Andrew-Zhang closed 4 months ago
Hello! Very cool project! In the paper, I saw that MSE loss is used between EVE and the image encoder. However, in the code, it looks like cosine similarity is used:
Could you tell me the motivation of using cosine similarity vs MSE? Thanks!
Hey, in the paper, we claim that we calculate the MSE loss between normalized features. Here, the way we implement it is actually equivalent.
Hello! Very cool project! In the paper, I saw that MSE loss is used between EVE and the image encoder. However, in the code, it looks like cosine similarity is used:
https://github.com/baaivision/EVE/blob/b34b2b4f12ddf429137ed94c7d44a93f54ab9d79/eve/model/multimodal_encoder/vision_tokenizer.py#L176-L178
Could you tell me the motivation of using cosine similarity vs MSE? Thanks!