LinWeizheDragon / FLMR

The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.
42 stars 2 forks source link

Question about image process in customized multi-modal document #11

Closed yuejunpeng closed 1 month ago

yuejunpeng commented 2 months ago

In FLMR's paper, the document is plain text without images; but in the code of this repository, customized documents can support multi-modal. Could you please tell me how the image is integrated with the text when it is actually implemented? If it is a customized multi-modal document? Is it similar to query, where the image is processed in two ways and then cancated together with the text?

LinWeizheDragon commented 2 months ago

We empirically found that even if the model is not trained on I+T->I+T, it has the ability to retrieve multi-modal documents if the visual encoder is reused on the document side. This was reported in the appendix of the FLMR paper. The reason behind this is that if the query image is similar to the document image, their embeddings are similar and aligned in the latent space. As a result, if you query the index with the query embeddings, they interact with the document image's embeddings in the index.