CLIP semantic image search

deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.73k stars 1.92k forks source link

CLIP semantic image search #1058

Closed flozi00 closed 1 year ago

flozi00 commented 3 years ago

Is your feature request related to a problem? Please describe. No, it would be just cool

Describe the solution you'd like Indexing and searching for images by text

Describe alternatives you've considered Jina already does, but since CLIP is in latest huggingface release it would be cool have it here too

Additional context I did some runs locally with my own photos and the results were amazing. Describing images instead of just keywords improves the performance masively, event special query working fine

But the biggest question I have is if you want to have vision data in this framework or not ?

lalitpagaria commented 3 years ago

@flozi00 nice suggestion. I also wanted to suggest the same. It is nice to support image documents which will suite VQA, Image search and other use cases.

Fews concerns I have are -

Breaking changes in existing framework where 'text' is hard-coded ie in Document class and other places
Long term support to maintain these features and without active volunteers, deepset will find difficult to maintain these community sponsored features

Overall it is nice to have it in haystack in my view but adding it will require good design discussion and proper long term planning. Frequent breaking changes will not be good. Also I see deepset already have handful and they would need active support from the community.

I see lot of good suggestions from the community, so how about having experimental feature stream to have a playground for these features and graduate matured features to mainline?

@Timoeller @tholor @PiffPaffM

tholor commented 3 years ago

It's pretty clear to me that we will eventually add other data types to Haystack. The vision here is really to build natural language interfaces to all kinds of data. This includes texts, images, tables, databases, logs ...

However, we want to nail the text case first and optimize it really end-to-end instead of allowing 5 formats with "50% solutions". TableQA is probably one of the bigger next additions and we are actively working on it right now. So long-story short, VQA is nothing that we will work on in the next weeks for sure, but it's on the longterm roadmap.

@lalitpagaria what do you mean with experimental stream? A separate branch here in the repo?

lalitpagaria commented 3 years ago

@tholor I am align with the vision. My only concern is prioritization. Hence suggested if we have process around it. In my view these are two most time consuming steps and of-course critical: Design Discussion and Code Review. Now able to come up with solution to resolve it.

Regarding experimental stream, I mean separate to have module experimental and branch experimental. Which will daily rebased with master. Any new code like VQA, CLIP which is not part of current roadmap or plan will go there. It will have nightly release. So people can contribute there which will have less stringent code review and design process. And once every month or quarter these can be bring to mainline based on user's feedback and roadmap (of course it will go through design discussion and code review). This is just my suggestion, I am open for other idea as well.

INF800 commented 3 years ago

Is your feature request related to a problem? Please describe. No, it would be just cool

Describe the solution you'd like Indexing and searching for images by text

Describe alternatives you've considered Jina already does, but since CLIP is in latest huggingface release it would be cool have it here too

Additional context I did some runs locally with my own photos and the results were amazing. Describing images instead of just keywords improves the performance masively, event special query working fine

But the biggest question I have is if you want to have vision data in this framework or not ?

Can you please share reference link for the one you've tried. I'd like to see results as well.

Thanks, Rakesh.

anakin87 commented 1 year ago

CLIP support was implemented by @ZanSara in #2418.

I think that this issue can be closed now.