Closed sbhttchryy closed 4 years ago
Hey @sbhttchryy ,
Interesting questions! Happy to answer / elaborate on them:
Hope this helps!
Hey I have similar questions too,
My background
I have been working on QA as retrieval tasks for about a year now. My focus is on "retriever" part. I also have a good understanding of models like DenSPI / SPARC / DPR .. etc. Recently I have tried applying DPR Retriever on TechQA dataset (from IBM), and to my surprise the results were not great (MRR 0.02), it is far less when compare to BM25 results.
I have few questions
How does the dense retriever work on custom datasets (non-wiki)? without fine-tuning. Have anyone tried before? If so, I would like to know the results. Is it better / close to BM25?
To get higher MRR score, what approaches do you recommend? Somehow combine Sparse(BM25) and Dense(BERT or DPR) results together to get final results? like how DenSPI / SPARC did.
Dear @tholor, 2 and 3 answer my question perfectly. As for 1, what I meant is, for a corpus, I want to implement both FAQ based QA system and an extractive QA system. I have manual annotators who are going to do that with the corpus. I don't know if the historical user questions can be availed or not. To annotate both the systems from scratch, do you have any suggestions for better result? Did you have any specific mechanism to reduce bias during the annotating process? Did you use negative examples? Thank you very much again. ^__^
How does the dense retriever work on custom datasets (non-wiki)? without fine-tuning
I haven't seen any metrics published for DPR on other datasets, but I would also curious to see how it performs there. We are still in the phase of evaluating if it's worth switching to DPR in our client deployments. From some first impressions, we had promising results in the financial domain. However, I also believe the full power will only be available with fine-tuning DPR to the domain. We are working on such an option in Haystack right now (#273).
To get higher MRR score, what approaches do you recommend?
I think fine-tuning DPR plus combining with sparse methods is very promising. The combination with sparse could either happen by mixing the documents (e.g. top 5 from DPR + top 5 from sparse) or combining scores. We have this on the roadmap for haystack later this year (see #125 )
@sbhttchryy There's a lot of things you can do to improve labeling quality. Elaborating on all of that is probably more a whole blog article / paper than a comment here, but I will try to share a few key points for extractive QA:
As mentioned, there's much more. Maybe we'll cover that in a future blog post ...
These resources might also be helpful for you:
Recently I have tried applying DPR Retriever on TechQA dataset (from IBM), and to my surprise the results were not great (MRR 0.02), it is far less when compare to BM25 results.
@RamanRajarathinam One more thing that came into my mind: How long were the passages that you indexed from TechQA with DPR? DPR is trained on 100-word-passages and will cut everything after max_seq_len. We saw that (not surprisingly) performance is dropping quite significantly for longer docs in some of our experiments.
@RamanRajarathinam One more thing that came into my mind: How long were the passages that you indexed from TechQA with DPR? DPR is trained on 100-word-passages and will cut everything after max_seq_len. We saw that (not surprisingly) performance is dropping quite significantly for longer docs in some of our experiments.
Hi @tholor , Just FYI - The TechQA Dataset paper. This dataset contains technical questions posted by users and discussion happened on the forums.
In our experiment, we trimmed each paragraph by 1000 characters and gave it to huggingface tokenizers (max_seq_length=512). We also tried applying sentence transformers on the same dataset and observed that results, such as Accuracy and MRR score were higher than DPR.
Attaching the rough metrics, and FYI - we didn't fine-tune on any model. Accuracy tells whether the gold passage is present in the retrieved top 10 or not, and the mrr is MRR@10
{
"sentence_transformers": {
"accuracy": 0.187,
"mrr": 0.117
},
"dpr": {
"accuracy": 0.025,
"mrr": 0.007
},
"BM25": {
"accuracy": 0.731,
"mrr": 0.513
}
}
Thanks for sharing these metrics @RamanRajarathinam - super interesting! Two observations I find interesting: a) Huge gap between BM25 and dense methods b) sentence-transformers significantly better than DPR
I had a quick look at the dataset trying to understand potential reasons.
# Example Question from dataset
"QUESTION_ID": "DEV_Q000",
"QUESTION_TITLE": "Web GUI 8.1 FP7 requires DASH 3.1.2.1 or later",
"QUESTION_TEXT": "\n\nYou wanted to install Web GUI 8.1 FP7, but your DASH version does not meet the required version.\n\nIM 1.8 displayed the following:\n\n ERROR: The installed IBM Dashboard Application Services Hub version is 3.1.0.3, but requires version 3.1.2.1 or later.\n\n",
A few hypotheses from my side:
@RamanRajarathinam Would you be interested in fine-tuning & evaluating a DPR model on TechQA once we have this feature available in Haystack? We are currently also preparing a benchmark website (speed + accuracy), where such results could be interesting to include.
@tholor Sure! I would be happy to test this feature.
Great! I will close this issue for now as the original questions have been addressed, but we should investigate the transferability of DPR to domains further in the future. I will ping you once we have DPR Training implemented and we'll see if we can share some experiences from our customers.
Hello developers, I have three questions regarding three different topics, which is why I thought it might be better if I club them together in one issue.
For developing a FAQ-style QA system for our custom dataset, are there any specifications the annotators must abide by for better result?
Do you guys plan to make something suitable for a real-time QA system like https://covidask.korea.ac.kr/ ?
For BERTserini (https://arxiv.org/abs/1902.01718), the retriever is based on Anserini. Have you guys experimented with Elasterini? If yes, have there been any significant improvement?
Thank you very much.