Closed tangli-udel closed 1 week ago
Hi, the target vision representation may come from UNet or transformer architecture. Therefore, we cannot assume that target vision representation embedding needs to have the exact meaning as CLIP "patch" embedding, since it may not even have patch. In other words, we cannot assume the order of the semantic appears exactly the same way in every vision representation. Hope this helps.
Hi,
Thank you for the great work!
I get confused about the calculation of A score.
"The matrix is the result of all pairs of embedding vectors within the same image, just like an attention matrix." Do the embedding vectors here mean embeddings of each image patch? If so, why do we calculate the similarity between all possible pairs of patch embedding? Doesn't it make more sense to just calculate the similarity between the pairs of patch embeddings at the same position?