Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...)

tholor commented 4 years ago

Is your feature request related to a problem? Please describe.

The finder was initially designed to wrap a single retriever and a single reader to do extractive QA. As Haystack is growing and covering more use cases we need to rethink the design to allow:

multiple retrievers (see #125)
a generator instead of a reader to do generative QA
pure document search (retriever only, see #420)
reader only
more complex routing patterns in the future (e.g. classify if a request is a question, if yes => reader, if no => pure retriever)
optional: Having multiple retrievers in series instead of parallel. Inspired by passage reranking methods. See this blog's section on two stage ranking

To consider:

Should be easy to exchange reader with generator (e.g. same API endpoint?)
Midterm: Import/Export config that describes whole setup

Some thoughts to get started...

=> Single vs. Multiple Finder classes?

Option B) Single Finder get_answers()

faq, generative, extractive
can we optimize the return format to cover all cases nicely? get_documents()

Option A) Multiple Finders a) splitting into faq, generative, extractive b) DocFinder QAFinder => we don't gain much except clearer init 2) One vs. splitting get_answers()? generate_answer, faq_answer ... 3) Passing List of retrievers vs. Finder.add(retriever) + Finder.add(retriever) + Finder.add(reader) 4) Redesign API endpoints (e.g. doc-qa still right naming?)

I'd lean towards... => Single Finder => get_answers() get_documents() => API: documents object (ids+meta) => remove get_answers_via_similar...()

Open question: FAQ via get_documents() or via get_answers()

guillim commented 4 years ago

We have a use case here @etalab with @psorianom, we have already talked about it (#125), but here is a summary :

Our need is to combine at least 2 retrievers :

sparse (BM25)
dense (ex: SBERT or DPR...)

Why: Mostly because BM25 is very efficient but lacks the retrieval of synonyms, which could be a good addon from the dense retrievers.

Note: It could be interesting also to combine many BM25 retrievers at some point, tuned in their own ways, so I like the idea of a modular finder.

=> We are very looking for having this available with to the FastAPI swagger

tholor commented 4 years ago

@guillim Yes, this case will definitely be covered!

lalitpagaria commented 4 years ago

Midterm: Import/Export config that describes whole setup

For above, we could check this out - https://github.com/facebookresearch/hydra

tholor commented 4 years ago

Quick Update:

We are currently thinking bigger here. With Haystack, we already have many nice "lego building blocks". However, we are missing a flexible, powerful way of sticking them together. Instead of having rather rigid Finder classes, we, therefore, think of introducing a highly flexible Pipeline class that is using a Directed Acyclic Graph under the hood (a bit like Apache Airflow).

You could add "tasks" as nodes (Retriever, Reader, Generator ...) and route your query via edges. This could cover not only all of the above use cases, but would also allow many other, more complex search pipelines that we have in mind for the future.

Happy to hear your feedback on this direction!

guillim commented 4 years ago

Sounds great to me. It would indeed fit many more complex situations, and be helpful for testing combos

lalitpagaria commented 4 years ago

Nice idea @tholor

Not related to this but to make it more extensible. How about adding remote API call and callback support. Mainly I am thinking inline of Jina framework. Specially to make haystack highly distributed by adding cloud native support. So Generator, Retriever, Docs Cleaner, APIs etc will run on their own env/containers/machine but each will communicate via RPC (gRPC or Http).

tholor commented 4 years ago

Yes, absolutely. Our idea is to start with a "local" Pipeline and get the API / usage straight and then later on enabling a "distributed" pipeline with the execution of single nodes on different containers/machines.

lalitpagaria commented 4 years ago

Awesome let me know if I can contribute to it

tholor commented 3 years ago

We implemented a first basic draft with #596. In the next weeks / months, we will extend it to:

allow more different nodes
have more utility functions (e.g. yml import / export)
make the underlying execution fully parallel and distributed

We will tackle those steps in individual PRs & Issues ...

deepset-ai / haystack

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544