Using FAQ Style retrieval for Document Similarity Use Case

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.63k stars 1.91k forks source link

Using FAQ Style retrieval for Document Similarity Use Case #104

Closed sonnylaskar closed 4 years ago

sonnylaskar commented 4 years ago

Hi,

I am wondering if the FAQ Style retrieval can also be used as a document similarity use case for Information retrieval.

Use Case: Say one has lots of articles stored in elasticsearch and given an input, find the closest matching article. This is a common doc similarity use case. So the user could follow the FAQ Style retrieval and create embedding from the article text field (lets consider that as question field used in FAQ tutorial) and an incoming input can be matched with this embedding to retrieve the best matches. That match can be considered as the most similar document given the input. The embedding creation process might be slow because the articles can be very long.

Please comment on what do you think about this use case and this implementation using haystack. Or if you suggest some other approach.

Thanks

tholor commented 4 years ago

Hey @sonnylaskar ,

Absolutely, you could create embeddings also for other texts than a question and use the EmbeddingRetriever to find the most similar documents. One remark though: usually, embeddings work well if they are created for similar units of text (e.g. two questions or two passages of text). If you have different units (e.g. one keyword query and one very long document) it's more tricky to have meaningful embeddings. What is the exact use case you have in mind? a) find most similar document given another "input" document b) find most similar documents given a "user query"

For a) you could pretty much use the current implementation using one encoder (embedding model) for both texts, while for b) it's usually better to have two separate encoders (e.g. see #63 ).

Both should be possible in haystack without big modifications. The bigger work is probably to find models that produce good embeddings for your use case. This usually depends on domain, length of text and style of language.

predoctech commented 4 years ago

Hi @tholor , This post invoked my further thoughts on use cases and along with FB's DPR highlighted in #63 the following questions come to my mind:

What is the right approach to "combine" results from the 2 retrievers for task b) above?
How to evaluate the difference in performance between methodologies like cosine similarity vs dot product?
Should there be a concept like "weights" that give more importance (e.g. higher ranking) to one encoder (say the question encoder) than the other (the passage encoder)?

Appreciate your thoughts.

tholor commented 4 years ago

For case b) you will need one model to create the embeddings for your documents when you index them ("doc encoder") and then another model later at inference time to create the embedding for your incoming user question ("question encoder"). This happens in quite different phases and there's no need to really combine/concatenate two retrievers. In Haystack you could init an EmbeddingRetriever(embedding_model=<doc-encoder>, ...) for indexing your documents and then later at inference time you init a second one EmbeddingRetriever(embedding_model=<question-encoder>, ...) with your question encoder. _(For simplicity, we could also think about adding an option here to have two embeddingmodels in the EmbeddingRetriever)
We are currently implementing evaluation options for Reader, Retriever and Finder in #92. Should be merged soon and you will be able to measure different performance metrics like Recall or mean avg. precision to compare different Retrievers / Readers and understand bottlenecks in your setup
Maybe you have a different use case in mind, but for the case b) from above you will take the embedding from question encoder and calculate the cosine similarity to the document embeddings. Introducing a weight for one of the embeddings seems not meaningful here. If you are thinking about having multiple fields that you want to compare (e.g. cosine sim. of question<->text_1 and cos. sim of question <-> headline) weights could be helpful, I believe.

predoctech commented 4 years ago

Thanks @tholor. On 3. yes the usr case I mentioned involved comparing the similarity of incoming question to 2 fields. In fact what I have in mind was to calculate (cosine sim. of user query <-> FAQ question) and (cosine sim. of user query <-> FAQ answer) and somehow combine the probabilities of the two in order to determine which FAQ pair is best given the user query. Reason behind is some context are actually contained in FAQ's answer rather than the question (e.g. keyword, domain jargon), but are likely to be included in the user query. So just by looking at (cosine sim. of user query <-> FAQ question) it may not be the best match but when also taking (cosine sim. of user query <-> FAQ answer) into account the right answer may become trivial. Is that something that can be handled in Haystack? Thanks.

tholor commented 4 years ago

Ok got it. This isn't implemented yet in Haystack but it would fit in the scope and I could see two options for the implementation:

a) Concatenation of Retrievers: Make the Finder accept a list of Retrievers and combine their scores (incl. weights) before feeding results to the Reader Pro: most generic solution also allowing combination of BM25 + EmbeddingRetriever Con: Two separate retriever queries (suboptimal efficiency in some cases)

b) New MultiEmbeddingRetriever: Extend the EmbeddingRetriever to the case of multiple Embeddings Pro: simple implementation without side effects on the regular user interface Con: won't allow the combination of other retriever methods (e.g. ElasticsearchRetriever + EmbeddingRetriver)

I am leaning towards a), but could see b) as a simpler shorter solution. Would you be interested in implementing one of them in a PR, @predoctech ? We could help you along the way. Otherwise we'll put it in our backlog and try to implement it in a few weeks from now.

predoctech commented 4 years ago

Thanks @tholor . Conceptually I'd think a) is much more powerful. You are right that even though I didn't mention ElasticsearchRetriever initially that is something that fit into our use case as well. Ideally the choice of various Retrievers, and correspondingly their weights, may shift according to the scores coming out from each iteration. For instance if the user query is an exact FAQ question with a 1.0 probability then only the Retriever on "question_emb" is ever needed. However if probability is low then maybe "answer_emb" will then kick in, and further down the road BM25 will be utilized on docs other than the FAQ dataset. Only a) can provide this sort of flexibility I think.

Unfortunately a) will be out of league given my skill level. I will try to understand haystack better to see if I stand a chance for b).

predoctech commented 4 years ago

@tholor I also wish to follow up on one remark you made above in reply #2 "embeddings work well if they are created for similar units of text ". So I suppose for "units" you refer to the token sequence length used by the pre-trained model. In that case what is the number for models like sentence_bert? The more important question maybe if there is any way to alter that length? Ideally for an EmbeddingRetriever use case the best performance will be when both user and faq queries contain similar number of tokens, and the context/semantics will be grouped in a similar fashion across multiple sequences between those 2 queries?

tholor commented 4 years ago

@predoctech sorry, somehow missed your follow-up questions here.

So I suppose for "units" you refer to the token sequence length used by the pre-trained model. 1) The texts that you convert at inference time should be as similar as possible to the ones used for training (ideally in length and language style). 2) Some models can also produce meaningful embeddings for texts that are longer than the ones at training time. Even in that case: If you create now one embedding for a short text (e.g. question) and one for long text (e.g. passage) the cosine similarity of both might not be very meaningful.

In that case what is the number for models like sentence_bert?

You could check the data that was used for training them. It's mostly NLI datasets working on a "sentence level". So their optimal performance will probably be on this unit of "one sentence" (e.g. a question) and might degrade for longer passages (e.g. a long FAQ answer).

The more important question maybe if there is any way to alter that length? Ideally for an EmbeddingRetriever use case the best performance will be when both user and faq queries contain similar number of tokens, and the context/semantics will be grouped in a similar fashion across multiple sequences between those 2 queries?

I would not worry too much if your user questions and FAQ questions differ by a few tokens. As mentioned above, it's more severe if you really compare a full passage with your single sentence. The best way to "alter that length" is to use two different encoders (as here https://github.com/deepset-ai/haystack/issues/63).

Hope this helps. Closing this for now and creating a new issue for the feature request of "Combining multiple retrievers". Feel free to reopen if there's more on your mind related to this particular discussion here :)