Query regarding FAQ-style QA, real time QA and Elasterini

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

17.59k stars 1.91k forks source link

Query regarding FAQ-style QA, real time QA and Elasterini #300

Closed sbhttchryy closed 4 years ago

sbhttchryy commented 4 years ago

Hello developers, I have three questions regarding three different topics, which is why I thought it might be better if I club them together in one issue.

For developing a FAQ-style QA system for our custom dataset, are there any specifications the annotators must abide by for better result?
Do you guys plan to make something suitable for a real-time QA system like https://covidask.korea.ac.kr/ ?
For BERTserini (https://arxiv.org/abs/1902.01718), the retriever is based on Anserini. Have you guys experimented with Elasterini? If yes, have there been any significant improvement?

Thank you very much.

tholor commented 4 years ago

Hey @sbhttchryy ,

Interesting questions! Happy to answer / elaborate on them:

I guess your are already aware that annotators should classify a pair of two sentences to be similar or not. For our FAQ use case, this will be two questions. One approach would be to take your existing FAQ questions, your historical user questions (e.g. from search logs) and mix them into random pairs. I would also recommend to do some augmentation of your FAQ questions (either manually or model based). One further improvement would be to not create random pairs but use an existing similarity model. This would lead to harder negatives and better models (you could do something like this in haystack by collecting user feedback via the rest api). Is that what you were after or which part of the annotation process are you wondering about?
Fast & scalable QA systems are the core of haystack. Both can be achieved via an retriever-reader pipeline (given a GPU) while usually yielding better accuracy than pure retrieval models. At the same time, we see that pure retrieval models like DenSPI / SPARC can be an interesting alternative. I can totally see having a similar retriever in haystack that operates on phrases (instead of paragraphs/docs) and uses a mix of dense and sparse vectors (see also discussion in #125 ). It could in fact be interesting to train a variant of DPR for phrases.
On the other hand, I do not see the entity search incorporated in covidask as a big priority in haystack right now as it's usefulness might heavily rely on the domain.
Haven't tried anserini, but in the BERTserini paper they state that they were just using BM25 from anserini. I would be surprised if this performs very different to elasticsearch.

Hope this helps!

raman-r-4978 commented 4 years ago

Hey I have similar questions too,

My background

I have been working on QA as retrieval tasks for about a year now. My focus is on "retriever" part. I also have a good understanding of models like DenSPI / SPARC / DPR .. etc. Recently I have tried applying DPR Retriever on TechQA dataset (from IBM), and to my surprise the results were not great (MRR 0.02), it is far less when compare to BM25 results.

I have few questions

How does the dense retriever work on custom datasets (non-wiki)? without fine-tuning. Have anyone tried before? If so, I would like to know the results. Is it better / close to BM25?
To get higher MRR score, what approaches do you recommend? Somehow combine Sparse(BM25) and Dense(BERT or DPR) results together to get final results? like how DenSPI / SPARC did.

A twitter thread you might be interested in

sbhttchryy commented 4 years ago

Dear @tholor, 2 and 3 answer my question perfectly. As for 1, what I meant is, for a corpus, I want to implement both FAQ based QA system and an extractive QA system. I have manual annotators who are going to do that with the corpus. I don't know if the historical user questions can be availed or not. To annotate both the systems from scratch, do you have any suggestions for better result? Did you have any specific mechanism to reduce bias during the annotating process? Did you use negative examples? Thank you very much again. ^__^

tholor commented 4 years ago

How does the dense retriever work on custom datasets (non-wiki)? without fine-tuning

I haven't seen any metrics published for DPR on other datasets, but I would also curious to see how it performs there. We are still in the phase of evaluating if it's worth switching to DPR in our client deployments. From some first impressions, we had promising results in the financial domain. However, I also believe the full power will only be available with fine-tuning DPR to the domain. We are working on such an option in Haystack right now (#273).

To get higher MRR score, what approaches do you recommend?

I think fine-tuning DPR plus combining with sparse methods is very promising. The combination with sparse could either happen by mixing the documents (e.g. top 5 from DPR + top 5 from sparse) or combining scores. We have this on the roadmap for haystack later this year (see #125 )

tholor commented 4 years ago

@sbhttchryy There's a lot of things you can do to improve labeling quality. Elaborating on all of that is probably more a whole blog article / paper than a comment here, but I will try to share a few key points for extractive QA:

Questions should be similar to what you expect at query time
Question types should be diverse ("what", "how", "when" ....)
Answers length should be as short as possible while including all relevant content
Answer length should be similar to what you want at query time (e.g. in some applications you rather want short facts, in others the answer is an "explaining sentence")
- We usually recommend a mix of predefined "standard questions" (one user defines them, labelers only find answers) and "custom questions" (the labeler creates the question while reading the article)
"Custom questions" tend to be more biased and this is a common critic for datasets like SQuAD. Predefined "standard questions" usually take a bit longer to label.
Let multiple annotators label the same questions (at least in the beginning) and calculate the Inter-Annotator Agreement to measure quality
Split the labeling process into different periods and train a model from time-to-time to check improvements. Communicate these improvements to the labelers. Focus labeling on the weak areas of the model (e.g. certain error patterns that you observe).
"Gamify" the labeling process by making it a competition among colleagues or something similar

As mentioned, there's much more. Maybe we'll cover that in a future blog post ...

These resources might also be helpful for you:

Some videos we used to explain labelers the process for our COVID QA dataset
Short manual of our labeling tool

tholor commented 4 years ago

Recently I have tried applying DPR Retriever on TechQA dataset (from IBM), and to my surprise the results were not great (MRR 0.02), it is far less when compare to BM25 results.

@RamanRajarathinam One more thing that came into my mind: How long were the passages that you indexed from TechQA with DPR? DPR is trained on 100-word-passages and will cut everything after max_seq_len. We saw that (not surprisingly) performance is dropping quite significantly for longer docs in some of our experiments.

raman-r-4978 commented 4 years ago

@RamanRajarathinam One more thing that came into my mind: How long were the passages that you indexed from TechQA with DPR? DPR is trained on 100-word-passages and will cut everything after max_seq_len. We saw that (not surprisingly) performance is dropping quite significantly for longer docs in some of our experiments.

Hi @tholor , Just FYI - The TechQA Dataset paper. This dataset contains technical questions posted by users and discussion happened on the forums.

In our experiment, we trimmed each paragraph by 1000 characters and gave it to huggingface tokenizers (max_seq_length=512). We also tried applying sentence transformers on the same dataset and observed that results, such as Accuracy and MRR score were higher than DPR.

Attaching the rough metrics, and FYI - we didn't fine-tune on any model. Accuracy tells whether the gold passage is present in the retrieved top 10 or not, and the mrr is MRR@10

{
  "sentence_transformers": {
    "accuracy": 0.187,
    "mrr": 0.117
  },
  "dpr": {
    "accuracy": 0.025,
    "mrr": 0.007
  },
  "BM25": {
    "accuracy": 0.731,
    "mrr": 0.513
  }
}

tholor commented 4 years ago

Thanks for sharing these metrics @RamanRajarathinam - super interesting! Two observations I find interesting: a) Huge gap between BM25 and dense methods b) sentence-transformers significantly better than DPR

I had a quick look at the dataset trying to understand potential reasons.

# Example Question from dataset
"QUESTION_ID": "DEV_Q000",
    "QUESTION_TITLE": "Web GUI 8.1 FP7 requires DASH 3.1.2.1 or later",
    "QUESTION_TEXT": "\n\nYou wanted to install Web GUI 8.1 FP7, but your DASH version does not meet the required version.\n\nIM 1.8 displayed the following:\n\n     ERROR: The installed IBM Dashboard Application Services Hub version is 3.1.0.3, but requires version 3.1.2.1 or later.\n\n",

A few hypotheses from my side:

quite many questions include important domain-specific keywords (see example above ) => usually benefits sparse methods
question length is very different (52 tokens in TechQA vs 9 in NQ) and questions contain quite elaborate "context" description => query encoder of DPR has never seen such texts
Performance of QA Reader models trained on NQ or SQuAD is also very bad (< 2.5 % F1) and only goes up after fine-tuning => maybe fine-tuning of DPR has a similar effect(?)

@RamanRajarathinam Would you be interested in fine-tuning & evaluating a DPR model on TechQA once we have this feature available in Haystack? We are currently also preparing a benchmark website (speed + accuracy), where such results could be interesting to include.

raman-r-4978 commented 4 years ago

@tholor Sure! I would be happy to test this feature.

tholor commented 4 years ago

Great! I will close this issue for now as the original questions have been addressed, but we should investigate the transferability of DPR to domains further in the future. I will ping you once we have DPR Training implemented and we'll see if we can share some experiences from our customers.