deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.56k stars 1.82k forks source link

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

Closed tholor closed 3 years ago

tholor commented 3 years ago

Is your feature request related to a problem? Please describe.

The finder was initially designed to wrap a single retriever and a single reader to do extractive QA. As Haystack is growing and covering more use cases we need to rethink the design to allow:

To consider:

Some thoughts to get started...

=> Single vs. Multiple Finder classes?

Option B) Single Finder get_answers()

Option A) Multiple Finders a) splitting into faq, generative, extractive b) DocFinder QAFinder => we don't gain much except clearer init 2) One vs. splitting get_answers()? generate_answer, faq_answer ... 3) Passing List of retrievers vs. Finder.add(retriever) + Finder.add(retriever) + Finder.add(reader) 4) Redesign API endpoints (e.g. doc-qa still right naming?)

I'd lean towards... => Single Finder => get_answers() get_documents() => API: documents object (ids+meta) => remove get_answers_via_similar...()

Open question: FAQ via get_documents() or via get_answers()

guillim commented 3 years ago

We have a use case here @etalab with @psorianom, we have already talked about it (#125), but here is a summary :

Our need is to combine at least 2 retrievers :

Why: Mostly because BM25 is very efficient but lacks the retrieval of synonyms, which could be a good addon from the dense retrievers.

Note: It could be interesting also to combine many BM25 retrievers at some point, tuned in their own ways, so I like the idea of a modular finder.

=> We are very looking for having this available with to the FastAPI swagger

tholor commented 3 years ago

@guillim Yes, this case will definitely be covered!

lalitpagaria commented 3 years ago

Midterm: Import/Export config that describes whole setup

For above, we could check this out - https://github.com/facebookresearch/hydra

tholor commented 3 years ago

Quick Update:

We are currently thinking bigger here. With Haystack, we already have many nice "lego building blocks". However, we are missing a flexible, powerful way of sticking them together. Instead of having rather rigid Finder classes, we, therefore, think of introducing a highly flexible Pipeline class that is using a Directed Acyclic Graph under the hood (a bit like Apache Airflow).

You could add "tasks" as nodes (Retriever, Reader, Generator ...) and route your query via edges. This could cover not only all of the above use cases, but would also allow many other, more complex search pipelines that we have in mind for the future.

image

Happy to hear your feedback on this direction!

guillim commented 3 years ago

Sounds great to me. It would indeed fit many more complex situations, and be helpful for testing combos

lalitpagaria commented 3 years ago

Nice idea @tholor

Not related to this but to make it more extensible. How about adding remote API call and callback support. Mainly I am thinking inline of Jina framework. Specially to make haystack highly distributed by adding cloud native support. So Generator, Retriever, Docs Cleaner, APIs etc will run on their own env/containers/machine but each will communicate via RPC (gRPC or Http).

tholor commented 3 years ago

Yes, absolutely. Our idea is to start with a "local" Pipeline and get the API / usage straight and then later on enabling a "distributed" pipeline with the execution of single nodes on different containers/machines.

lalitpagaria commented 3 years ago

Awesome let me know if I can contribute to it

tholor commented 3 years ago

We implemented a first basic draft with #596. In the next weeks / months, we will extend it to:

We will tackle those steps in individual PRs & Issues ...