THUDM / CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B
Apache License 2.0
1.46k stars 78 forks source link

Use learned image-text embedding #5

Open nivibilla opened 1 month ago

nivibilla commented 1 month ago

Feature request / 功能建议

Hi, is it possible to use the image embedding seperately to do image retrieval based on a query?

Motivation / 动机

Want to do RAG on images.

Your contribution / 您的贡献

Not sure if it's possible yet.

zRzRzRzRzRzRzR commented 1 month ago

Sure, but retrieval is hard, and the vector database is not supported (if you are using a vector database to participate in the RAG process). You can refer to the solution provided by langchain. Use multimodal embeddings to embed images and text together. Use similarity search for retrieval, but only link to images in the document library. Pass the original images and text blocks to multimodal LLMs for synthesis.

nivibilla commented 1 month ago

I'm not really concerned with live retrieval. I will basically do it all offline in batch. Thanks for the link I will have a look. Just curious to see how it would do compared to using sigLIP for example. Is it possible to download without langchain btw?

zhengxingmao commented 1 month ago

I would like to follow up on this question. Which interfaces should be called to obtain the embedding values for the input image and text?