Closed darkpromise98 closed 5 months ago
a nice work for Visualized-BGE!
I'm curious about the implementation methods of embedding fusion for multi-modal embedding.
We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks?
Thank you for your attention!
EVA-CLIP generates image token embeddings for the image patches, which are then fused together with text token embeddings into the BGE-M3 model. The text token embedding here refers to the token embeddings generated by the token embedding layer of BERT, which precede the transformer encoder layers.
a nice work for Visualized-BGE! I'm curious about the implementation methods of embedding fusion for multi-modal embedding. We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks?
Thank you for your attention!
EVA-CLIP generates image token embeddings for the image patches, which are then fused together with text token embeddings into the BGE-M3 model. The text token embedding here refers to the token embeddings generated by the token embedding layer of BERT, which precede the transformer encoder layers.
Thanks for your reply. I am looking forward to the technical report of this work. Would you have a plan to publish the paper in the future?
a nice work for Visualized-BGE! I'm curious about the implementation methods of embedding fusion for multi-modal embedding. We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks?
Thank you for your attention! EVA-CLIP generates image token embeddings for the image patches, which are then fused together with text token embeddings into the BGE-M3 model. The text token embedding here refers to the token embeddings generated by the token embedding layer of BERT, which precede the transformer encoder layers.
Thanks for your reply. I am looking forward to the technical report of this work. Would you have a plan to publish the paper in the future?
Yes, we plan to publish it within a month.
a nice work for Visualized-BGE!
I'm curious about the implementation methods of embedding fusion for multi-modal embedding.
We can get image embedding by EVA-CLIP and text embedding by BGE-M3, then how to get the fused embedding for hybrid modal retrieval tasks?