Closed yuejunpeng closed 6 months ago
We empirically found that even if the model is not trained on I+T->I+T, it has the ability to retrieve multi-modal documents if the visual encoder is reused on the document side. This was reported in the appendix of the FLMR paper. The reason behind this is that if the query image is similar to the document image, their embeddings are similar and aligned in the latent space. As a result, if you query the index with the query embeddings, they interact with the document image's embeddings in the index.
In FLMR's paper, the document is plain text without images; but in the code of this repository, customized documents can support multi-modal. Could you please tell me how the image is integrated with the text when it is actually implemented? If it is a customized multi-modal document? Is it similar to query, where the image is processed in two ways and then cancated together with the text?