cheshire-cat-ai / core

Production ready AI agent framework
https://cheshirecat.ai
GNU General Public License v3.0
2.15k stars 282 forks source link

Support for model multimodal #564

Open Jhonnyr97 opened 7 months ago

Jhonnyr97 commented 7 months ago

Is your feature request related to a problem? Please describe. I'm frustrated when I can't use multimodal models like "gpt-4-vision-preview" in Cheshire-cat-ai to process and retrieve information from images via the API. Additionally, the current vector database should support image retrieval.

Describe the solution you'd like I would like to see support for multimodal models, specifically the "gpt-4-vision-preview" model, integrated into Cheshire-cat-ai. This integration should allow users to send images via the Cheshire-cat-ai API and receive responses or results based on both text and images.

Furthermore, I'd like to utilize the existing vector database to enable Cheshire-cat-ai to perform retrieval with images. This means users should be able to search for information within the database using both text and images as search keys.

This feature would significantly enhance Cheshire-cat-ai's capabilities, enabling better understanding and generation of multimodal content. It's particularly valuable in scenarios where information is presented in both text and image formats.

Describe alternatives you've considered I've considered alternative solutions, but integrating multimodal models and image retrieval directly into Cheshire-cat-ai seems to be the most straightforward and effective approach. Other alternatives may require external tools or complex workarounds.

Additional context No additional context at this time, but this feature would greatly enhance Cheshire-cat-ai's versatility and utility.

nickprock commented 7 months ago

Hi @Jhonnyr97 the multimodal cat is planned. If you're able to help us in the development you're welcome!

Jhonnyr97 commented 7 months ago

okay, where can I find the documentation for multimodal?

nickprock commented 7 months ago

for the time being I am trying to throw down a list of links, as soon as I have discussed it with other core-devs I will share it in this issue. Meanwhile you can look to see if langchian multimodal allows you to use the model you are interested in and these wonderful plugins artistic_cat and WhisperingCat

@pieroit you can assign this issue to me.

nickprock commented 7 months ago

20231124_061834.jpg

Multimodality flow by LlamaIndex

pieroit commented 7 months ago

@nickprock we can setup an image embedder module like the text embedder we already have

Not clear to me yet how to crossindex texts and images

nickprock commented 7 months ago

@pieroit the image is a placeholder for me 😅 I promise you that I will arrive at the multimodality meeting after studying the problem.

nicola-corbellini commented 7 months ago

Here it seems they are embedding with two separate models (CLIP and Ada) in two different collections and then they retrieve from each using the double embedded query, isn't it?

nickprock commented 7 months ago

Yes, I must check the Qdrant doc for multimodal storage and retrieve.

nickprock commented 1 month ago

@nicola-corbellini as discussed in dev meeting I paste here some links as placeholder:

https://docs.llamaindex.ai/en/stable/examples/multi_modal/gpt4v_multi_modal_retrieval/

https://medium.aiplanet.com/multimodal-rag-using-llamaindex-gemini-and-qdrant-f52c5b68b367

https://qdrant.tech/documentation/examples/aleph-alpha-search/

https://colab.research.google.com/github/qdrant/examples/blob/master/qdrant_101_image_data/04_qdrant_101_cv.ipynb

nicola-corbellini commented 1 month ago

thank you, i'll try to take a look in the next days. Personally, I would go for supporting native multimodal models and leave hybrid/workaround solutions to plugins. Considering what @Pingdred said about the support in Langchain, probabily, we are too early