How to fuse the image embedding and text embeddings in Visualized-BGE?

FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs

MIT License

6.7k stars 480 forks source link

How to fuse the image embedding and text embeddings in Visualized-BGE? #639

Closed darkpromise98 closed 5 months ago

darkpromise98 commented 5 months ago

a nice work for Visualized-BGE!

I'm curious about the implementation methods of embedding fusion for multi-modal embedding.

We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks？

JUNJIE99 commented 5 months ago

a nice work for Visualized-BGE!

I'm curious about the implementation methods of embedding fusion for multi-modal embedding.

We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks？

Thank you for your attention!

EVA-CLIP generates image token embeddings for the image patches, which are then fused together with text token embeddings into the BGE-M3 model. The text token embedding here refers to the token embeddings generated by the token embedding layer of BERT, which precede the transformer encoder layers.

darkpromise98 commented 5 months ago

a nice work for Visualized-BGE! I'm curious about the implementation methods of embedding fusion for multi-modal embedding. We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks？

Thank you for your attention!

EVA-CLIP generates image token embeddings for the image patches, which are then fused together with text token embeddings into the BGE-M3 model. The text token embedding here refers to the token embeddings generated by the token embedding layer of BERT, which precede the transformer encoder layers.

Thanks for your reply. I am looking forward to the technical report of this work. Would you have a plan to publish the paper in the future?

JUNJIE99 commented 5 months ago

a nice work for Visualized-BGE! I'm curious about the implementation methods of embedding fusion for multi-modal embedding. We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks？

Thank you for your attention! EVA-CLIP generates image token embeddings for the image patches, which are then fused together with text token embeddings into the BGE-M3 model. The text token embedding here refers to the token embeddings generated by the token embedding layer of BERT, which precede the transformer encoder layers.

Thanks for your reply. I am looking forward to the technical report of this work. Would you have a plan to publish the paper in the future?

Yes, we plan to publish it within a month.