Open nivibilla opened 6 months ago
Sure, but retrieval is hard, and the vector database is not supported (if you are using a vector database to participate in the RAG process). You can refer to the solution provided by langchain. Use multimodal embeddings to embed images and text together. Use similarity search for retrieval, but only link to images in the document library. Pass the original images and text blocks to multimodal LLMs for synthesis.
I'm not really concerned with live retrieval. I will basically do it all offline in batch. Thanks for the link I will have a look. Just curious to see how it would do compared to using sigLIP for example. Is it possible to download without langchain btw?
I would like to follow up on this question. Which interfaces should be called to obtain the embedding values for the input image and text?
Feature request / 功能建议
Hi, is it possible to use the image embedding seperately to do image retrieval based on a query?
Motivation / 动机
Want to do RAG on images.
Your contribution / 您的贡献
Not sure if it's possible yet.