bronyayang / Law_of_Vision_Representation_in_MLLMs

Official implementation of the Law of Vision Representation in MLLMs
https://arxiv.org/abs/2408.16357
123 stars 7 forks source link

Questions on Equation 2, A score calculation #4

Closed tangli-udel closed 1 week ago

tangli-udel commented 1 week ago

Hi,

Thank you for the great work!

I get confused about the calculation of A score.

The equation is a bit of confusing, and we will fix it in the next version. The matrix is the result of all pairs of embedding vectors within the same image, just like an attention matrix. We then do some max and average to get a number of A score of this image. In the end, the score is averaged across all images. Basically, this is the same as what you said here. The code of computing A score is here.

"The matrix is the result of all pairs of embedding vectors within the same image, just like an attention matrix." Do the embedding vectors here mean embeddings of each image patch? If so, why do we calculate the similarity between all possible pairs of patch embedding? Doesn't it make more sense to just calculate the similarity between the pairs of patch embeddings at the same position?

bronyayang commented 1 week ago

Hi, the target vision representation may come from UNet or transformer architecture. Therefore, we cannot assume that target vision representation embedding needs to have the exact meaning as CLIP "patch" embedding, since it may not even have patch. In other words, we cannot assume the order of the semantic appears exactly the same way in every vision representation. Hope this helps.