LinWeizheDragon / FLMR

The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.
71 stars 4 forks source link

Something about paper #34

Closed lzhptr closed 1 month ago

lzhptr commented 1 month ago

Does preflmr ​​use all the tokens in a document when calculating similarity? Thank you!

LinWeizheDragon commented 1 month ago

When directly computing the exact similarity between query and doc (e.g. in training), all the tokens participate in computation. The similarity is sum(MaxSim(q, d)) where q,d are late interaction embeddings of query and doc. See the model's forward function. In retrieval, the engine does an approximate search. For every query token, it searches the most similar token embedding in the corpus and sums up the similarity. So only the retrieved doc token embeddings are incorporated in the returned similarity score. Of course, you can retrieve the exact similarity score using the original doc embeddings after index retrieval.

lzhptr commented 1 month ago

When directly computing the exact similarity between query and doc (e.g. in training), all the tokens participate in computation. The similarity is sum(MaxSim(q, d)) where q,d are late interaction embeddings of query and doc. See the model's forward function. In retrieval, the engine does an approximate search. For every query token, it searches the most similar token embedding in the corpus and sums up the similarity. So only the retrieved doc token embeddings are incorporated in the returned similarity score. Of course, you can retrieve the exact similarity score using the original doc embeddings after index retrieval.

Thank you very much! What are the number of tokens for text queries, images and documents in the preflmr ​​model? Thanks

lzhptr commented 1 month ago

When directly computing the exact similarity between query and doc (e.g. in training), all the tokens participate in computation. The similarity is sum(MaxSim(q, d)) where q,d are late interaction embeddings of query and doc. See the model's forward function. In retrieval, the engine does an approximate search. For every query token, it searches the most similar token embedding in the corpus and sums up the similarity. So only the retrieved doc token embeddings are incorporated in the returned similarity score. Of course, you can retrieve the exact similarity score using the original doc embeddings after index retrieval.

Thank you very much! What are the number of tokens for text queries, images and documents in the preflmr ​​model? Thanks

image the token numbers of 1 、2、3 and 4

LinWeizheDragon commented 1 month ago

32, 32, 32*num_patch_embeddings (depending on the vision encoder), 512

lzhptr commented 1 month ago

32, 32, 32*num_patch_embeddings (depending on the vision encoder), 512

Thank u