Open Jhonnyr97 opened 7 months ago
Hi @Jhonnyr97 the multimodal cat is planned. If you're able to help us in the development you're welcome!
okay, where can I find the documentation for multimodal?
for the time being I am trying to throw down a list of links, as soon as I have discussed it with other core-devs I will share it in this issue. Meanwhile you can look to see if langchian multimodal allows you to use the model you are interested in and these wonderful plugins artistic_cat and WhisperingCat
@pieroit you can assign this issue to me.
Multimodality flow by LlamaIndex
@nickprock we can setup an image embedder module like the text embedder we already have
Not clear to me yet how to crossindex texts and images
@pieroit the image is a placeholder for me 😅 I promise you that I will arrive at the multimodality meeting after studying the problem.
Here it seems they are embedding with two separate models (CLIP and Ada) in two different collections and then they retrieve from each using the double embedded query, isn't it?
Yes, I must check the Qdrant doc for multimodal storage and retrieve.
@nicola-corbellini as discussed in dev meeting I paste here some links as placeholder:
https://docs.llamaindex.ai/en/stable/examples/multi_modal/gpt4v_multi_modal_retrieval/
https://medium.aiplanet.com/multimodal-rag-using-llamaindex-gemini-and-qdrant-f52c5b68b367
https://qdrant.tech/documentation/examples/aleph-alpha-search/
thank you, i'll try to take a look in the next days. Personally, I would go for supporting native multimodal models and leave hybrid/workaround solutions to plugins. Considering what @Pingdred said about the support in Langchain, probabily, we are too early
Is your feature request related to a problem? Please describe. I'm frustrated when I can't use multimodal models like "gpt-4-vision-preview" in Cheshire-cat-ai to process and retrieve information from images via the API. Additionally, the current vector database should support image retrieval.
Describe the solution you'd like I would like to see support for multimodal models, specifically the "gpt-4-vision-preview" model, integrated into Cheshire-cat-ai. This integration should allow users to send images via the Cheshire-cat-ai API and receive responses or results based on both text and images.
Furthermore, I'd like to utilize the existing vector database to enable Cheshire-cat-ai to perform retrieval with images. This means users should be able to search for information within the database using both text and images as search keys.
This feature would significantly enhance Cheshire-cat-ai's capabilities, enabling better understanding and generation of multimodal content. It's particularly valuable in scenarios where information is presented in both text and image formats.
Describe alternatives you've considered I've considered alternative solutions, but integrating multimodal models and image retrieval directly into Cheshire-cat-ai seems to be the most straightforward and effective approach. Other alternatives may require external tools or complex workarounds.
Additional context No additional context at this time, but this feature would greatly enhance Cheshire-cat-ai's versatility and utility.