Question about image process in customized multi-modal document

LinWeizheDragon / FLMR

The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.

71 stars 4 forks source link

We empirically found that even if the model is not trained on I+T->I+T, it has the ability to retrieve multi-modal documents if the visual encoder is reused on the document side. This was reported in the appendix of the FLMR paper. The reason behind this is that if the query image is similar to the document image, their embeddings are similar and aligned in the latent space. As a result, if you query the index with the query embeddings, they interact with the document image's embeddings in the index.

LinWeizheDragon / FLMR

Question about image process in customized multi-modal document #11