Questions on Equation 2, A score calculation

Hi,

Thank you for the great work!

I get confused about the calculation of A score.

The equation is a bit of confusing, and we will fix it in the next version. The matrix is the result of all pairs of embedding vectors within the same image, just like an attention matrix. We then do some max and average to get a number of A score of this image. In the end, the score is averaged across all images. Basically, this is the same as what you said here. The code of computing A score is here.

"The matrix is the result of all pairs of embedding vectors within the same image, just like an attention matrix." Do the embedding vectors here mean embeddings of each image patch? If so, why do we calculate the similarity between all possible pairs of patch embedding? Doesn't it make more sense to just calculate the similarity between the pairs of patch embeddings at the same position?

bronyayang / Law_of_Vision_Representation_in_MLLMs

Questions on Equation 2, A score calculation #4