deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.56k stars 1.82k forks source link

Sliding window for Dense Passage Retriever #461

Closed jonas-nothnagel closed 3 years ago

jonas-nothnagel commented 3 years ago

Question Hello good people from haystack,

I am currently experimenting with the Dense Passage Retriever and I wanted to ask a quick question Assuming I am working with longer documents (>1000 characters) and I use a reader with sliding window, for instance: reader = FARMReader(... max_seq_len=512, doc_stride=50 ...)

Wouldn't it make sense to also use a sliding window when embedding the documents for the retrieval stage with the Dense Passage Retriever. I see that I can set the max_seq_len, but no sliding window? Is there a reason why the underlying BERT model does not allow for a sliding window?

tholor commented 3 years ago

Hey @jonas-nothnagel ,

You are right that DPR cuts texts at max_seq_len and only those first tokens will be considered for creating the embedding. The current best practice is to do the splitting of docs at preprocessing time, i.e. 1) split your docs into smaller chunks (e.g. hard cut at "100 words" like in the original DPR paper) 2) add them to the documentstore via document_store.write_documents()

Doing this at preprocessing time rather than in the retriever itself via a sliding window has a few advantages, incl.:

We are already working on a simpler usage of preprocessing in Haystack (#378) and want to include some further options for splitting (e.g. word-based splits, respecting sentence boundaries ...).

nsankar commented 3 years ago

@jonas-nothnagel
@tholor This is an interesting and perhaps a fairly complex data preprocessing problem to solve in order to get meaningful answers instead of partial answers. (i.e respecting sentence boundaries in the text content while using the max_sequence_length). This is one of the areas I am stuck with DPR in terms of the relevance of answers and exploring preprocessing options. I am exploring senetence detection with custom boundries using spacy as cited in an example in this web site.. ( https://realpython.com/natural-language-processing-spacy-python/ )

It would be good to try when task #378 is made available.

jonas-nothnagel commented 3 years ago

In general, it feels like data preprocessing is really the key for a successful QA system (and probably almost all ML tasks).

@nsankar Assuming we use something like spacy and chunk the text into sentences (further assuming our text comes in a well-processable format with proper punctuations and sentences), would you consider re-merging sentences back to longer paragraphs (<512 tokens of course) before feeding into the DPR? I do not have a computational base for this argument, but I feel like there must be threshold, where providing too many candidates for a retriever will also significantly worsen the performance.

Perhaps it would be interesting to find a way to also base the pre-processing on the kind of questions we want to ask. If we are sure that the answers to our questions are single tokens, contained in clearly distinguishable sentences, then splitting on sentence level may be useful, however, if the answers to the questions are mostly longer phrases or whole sentences then we probably should not chunk down to individual sentences.

Excited to see outcomes of https://github.com/deepset-ai/haystack/issues/378

nsankar commented 3 years ago

@jonas-nothnagel As to your question about longer paragraphs, this is exactly what I have been thinking about. Either we should be able to extract the paragraphs within the token length of 512 or smaller or assemble the sentences in to paras. Once there is a way to prepare text in this manner, I believe it requires some extensive testing and observations with different types of content / size.

Also as you said, the type of question /answer can matter. For instance the sort of answers relevant for a factoid and non-factoid scenario.

tholor commented 3 years ago

I do not have a computational base for this argument, but I feel like there must be threshold, where providing too many candidates for a retriever will also significantly worsen the performance.

I have a similar gut feeling, that single sentences would be problematic for many retrievers. Even in cases where the answer is just contained in a single sentence, the bigger "topic" / "context" is only clear from a few sentences.

Once there is a way to prepare text in this manner, I believe it requires some extensive testing and observations with different types of content / size.

Yep, I think there won't be a single perfect splitting + cleaning config for all datasets and languages. However, some experiments here could help to understand a bit better in what scenarios, which preprocessing option is preferable.

We are currently implementing a few options in #473

tholor commented 3 years ago

A basic version was implemented in #473 that, for example, allows splitting longer texts every 100 words with a sliding window and respecting sentence boundaries (do not split in the middle of a sentence).

tholor commented 3 years ago

See docs for usage details: https://haystack.deepset.ai/docs/latest/preprocessingmd#PreProcessor