Different answers for closely related questions.

zshnhaque commented 4 years ago

Question I am currently working on my custom Text datasets of 45 Documents for Open Domain Question Answering, I am using Dense Passage Retriever , Elastic Search Retriever , Embedding Retriever and TF-IDF Retriever and FARM based Roberta2 model as Reader, with these 4 combination of Retriever-Reader (on Colab GPU and Elasticsearch** Document Store) , I tried few questions like Q1 ) What is echo ? Q2) What is the definition of echo ? Q3) What is meant by echo ? Q4) What means echo ?

For each of these 4 different questions but very much related questions, I am getting different predicted answers (meaning they are referring to different context).

Also, in all case, top_k_reader = 5, top_k_retriever = 3 were initialised.

It will be much appreciated if anyone can suggest any recommendations or any methods to deal with this issue.

Additional context Add any other context or screenshots about the question (optional).

┆Issue is synchronized with this Jira Task by Unito

Timoeller commented 4 years ago

Wow, those are very interesting tests you are doing there. Since you are using the same Reader I presume you are getting back very different documents from the different retrievers you tried.

Do the different answers make sense in most cases?
How long are your documents? If they are long, think about splitting them up before indexing.
Did you increase the top_k_retriever parameter? Then the reader should get a more similar set of documents to look for the actual answer across the retrievers.
Did you try with other questions? E.g. longer questions that you then reformulate? There I would suspect the DPR performing most consistently.

Happy to interact further here and also in depth.

nsankar commented 4 years ago

perhaps the way we do the (text) data splitting (preprocessing) might have a say in improving the models and predictions. For, instance, when I had used NLTK's punkt tokenizer on 9 pdf documents , each ~ 70 pages and I could get a very accurate answer for a difficult question with DPR and Embedding Retriever. Later when I used Spacy's tokenizer to split the text data, I couldnot get anywhere close to the same accuracy...

zshnhaque commented 4 years ago

Do the different answers make sense in most cases?

How long are your documents? If they are long, think about splitting them up before indexing.

Did you increase the top_k_retriever parameter? Then the reader should get a more similar set of documents to look for the actual answer across the retrievers.

Did you try with other questions? E.g. longer questions that you then reformulate? There I would suspect the DPR performing most consistently.

Thank you @Timoeller for your helpful response. I will respond to each statement(italicized) in serial order.

Since you are using the same Reader I presume you are getting back very different documents from the different retrievers you tried.

Yes with the same reader, I am getting different set of documents, for example for a given Question 'What is echo?' and 'What is the definition of echo?' and predicting Answer with DPR and Roberta2 Farm Reader with top_k_retriever=5, top_k_reader=3, I get answers from doc A,B,C for Q = 'What is echo?' and I get answers from A,B,D for Q = 'What is meant by echo?' , so in the respective prediction there can be common documents like A,B but one document 'C' appear in 1st case and not in 2nd Case. Also, the desired answer in both case has different rank.

Do the different answers make sense in most cases? Not exactly, in some cases the desired answer is low ranked but the same answer is ranked 'one' in other kinds of question (but closely related question) for the same combination of DPR-FARM reader and same top_k_retriever and top_k_reader.

How long are your documents? If they are long, think about splitting them up before indexing.

I have documents which range from 10 pages to 100 pages, I will definitely implement your suggestion of splitting the document and then storing them in elastic datastore.

_Did you increase the top_kretriever parameter? Then the reader should get a more similar set of documents to look for the actual answer across the retrievers.

I tried increasing the top_k_retriever to 10, for DPR-FARM based retriever-reader combination, it performs better as compared to top_k_reader = 5 in almost all case of question (here I used again DPR-FARM reader) but fails to predict the desired answer in particular 1 question.

decent results - Q) What is echo ? (DPR-FARM Reader , Top_k_retriver = 10, Top_k_reader = 5)

Q) What is meant by echo ? (DPR-FARM Reader , Top_k_retriver = 10, Top_k_reader = 5)

Q) What is the definition of echo? (DPR-FARM Reader , Top_k_retriver = 10, Top_k_reader = 5)

bad results - Q) What means by Echo? (DPR-FARM Reader , Top_k_retriver = 10, Top_k_reader = 5)

Did you try with other questions? E.g. longer questions that you then reformulate? There I would suspect the DPR performing most consistently.

Let me try with your suggestion of a long question and see the results with DPR.

Timoeller commented 4 years ago

Thanks for the detailed report. Judging the answers I think they all make sense, except for answers to the "What means by echo?" question that is a bit unspecific to be honest.

Looking forward seeing if longer and more elaborate questions can help in narrowing down the answers. A suggestion would be to ask: How is echo defined during a telephone call? Since this is the information you appear to be expecting.

zshnhaque commented 4 years ago

Thanks for the detailed report. Judging the answers I think they all make sense, except for answers to the "What means by echo?" question that is a bit unspecific to be honest.

Looking forward seeing if longer and more elaborate questions can help in narrowing down the answers. A suggestion would be to ask: How is echo defined during a telephone call? Since this is the information you appear to be expecting.

Thank you @Timoeller for your response again, I need help regarding the size of documents that I am storing as document store in Elasticsearch, can you please suggest what should be size or splitting criteria to get better prediction.

Timoeller commented 4 years ago

Regarding size: I cannot answer your questions because it highly depends on the input text + questions + model and parameters you are using. Regarding splitting criteria, you might want to split text chunks with the same meaning, like paragraphs. Here we have a conversion function where you can insert custom preprocessing and split paragraphs (in a very simple way).

If you have a larger annotated test set you could try different values and let the data decide? Is that an option? If you need help creating QA datasets you might be interested in our annotation tool here: https://annotate.deepset.ai/

Timoeller commented 4 years ago

Hey @zshnhaque any progress on splitting the datafiles and getting more consistent results when asking longer questions?

zshnhaque commented 4 years ago

Hello @Timoeller, thank you for reaching me out regarding my findings, well longer questions if I specify properly, I am able to get better results with ElasticSearch and TF-IDF Retriever + FARM based Roberta2 reader, for DPR and Embedded Retriever + FARM Roberta results are satisfactory (desired answers are lowly ranked), secondly, I found splitting the document is not necessary, rather cleaning my documents is helping me, by cleaning I mean, removal of unnecessary sentences or text. Currently, I am looking for improving my results where I am seeking a long answer to a question. Also, I tried deep-set annotator earlier, but I found there is an updated version of deepset-annotator, where there is an 'annotation-mode' option especially for 'question-answering' . Once I complete my cleaning process, I will start annotation for my fine-tuning process. So as of now, long questions (given it is specific) giving better results with elasticsearch and TF-IDF Retriever but I hope I can replicate similar scenario with DPR too, I will update here soon, once I see favorable results with the cleaning of docs and implement Annotator+ Fine Tuning. Once again thank you for asking me about my progress.

Timoeller commented 4 years ago

Great, thanks for the update.

Yes, in the new annotation mode there is the option to annotate long answers (in contrast to short SQuAD style answers). There is also a new feature to upload a CSV with questions that can be linked to specific documents.

Looking forward to further updates from your side.

zshnhaque commented 4 years ago

Hello @Timoeller , just to give you an update about the above issue of long answer. Cleaning of document and proper annotation is indeed helping me. For a set of 250 questions from 8 documents, I am achieving 79 % of top-1 accuracy and 47 % of top 1 Exact match with Elastic Search Retriever and Fine Tuned Roberta model, but it is significantly higher as compared to pre-trained results. Also, I tried same annotator file with fine tuning Bert Large uncased pre-trained model, so in this case dense passage retriever with fine tuned bert large uncased model, I am achieving 78 % Top- 1 accuracy, but this time top 1 Exact Match score improved to 68% (significant improve from previous case), but this combination of fine tuned bert large uncased model and DPR takes lots of computation time. But in a nutshell, I can say, cleaning of documents and annotation does helping me in improving the results, I need to investigate low EM score for Roberta based fine tuned model.

Timoeller commented 4 years ago

Wow, that is impressive. 68% EM for out of domain data. Do you also have f1 scores?

So the total performance comes from 2 parts, the retriever and reader. ES retriever + roberta base reader gives 47% EM DPR retriever + bert large reader gives 67% EM

Did you also try the smaller roberta model with DPR, so you can narrow down where the performance diff is coming from? Apart from this I would also suggest to manaully inspect your 250 labels + predictions for both settings. Maybe the bert large reader only has small changes with a large effect on EM.

zshnhaque commented 4 years ago

Yes, for ES + Bert Large Uncased, top-1 F1 score is 75.3% and top-k F1 score is 89.4% .

The summary for the 3 combination of finder evaluation is enlisted below.

Fine- Tuned Models	Top-1 accuracy	Top-k accuracy	Top-1 EM	Top-k EM	Top-1 F1	Top-k F1
custom_farm_model_Bert_large_uncased	77.9%	94.8%	68.3%	75.9%	75.3%	89.4%
custom_farm_model_Roberta	78.7%	96.4%	47.0%	53.4%	70.8%	82.5%
custom_farm_model_distill	59.4%	87.1%	41.4%	54.2%	52.8%	73.0%

Did you also try the smaller roberta model with DPR, so you can narrow down where the performance diff is coming from?

Yes, I tried DPR with fined tuned roberta model. Only issue was total finder time, because DPR was taking more total retrieval as compared to ES + fined tuned roberta model, and the difference being only 4 seconds (total retrieval time), rest accuracy, EM and F1-score were same in both case.

Apart from this I would also suggest to manaully inspect your 250 labels + predictions for both settings. Maybe the bert large reader only has small changes with a large effect on EM.

Thank you , I will doing checks on the actual and predicted answer especially for long answer, that will give clear picture about large difference in EM score

Timoeller commented 4 years ago

Nice, thanks for sharing this with the community! I will also forward the analysis to our engineers.

About bert-large vs roberta. The numbers are a bit strange, e.g. top1accuracy being higher for roberta but EM is so much smaller. I suspect both models are just returning slightly longer/shorter answers and the bert-large answer length seem to fit your use case more. Looking forward to the analysis of the actual predictions vs labels.

zshnhaque commented 4 years ago

Hello @Timoeller , I did evaluations on Question with long and short answers separately, following are the tables for both evaluation. Previously, I did evaluation on 250 Question Answer using Fined Tuned Roberta and Fined Tuned Bert Large Model. In the set of 250 QA, I have 108 Questions with a short answer and 142 Question with a long answer.

For the reference, I am putting the evaluation of 250 Questions again.

Fine- Tuned Models + Retriever	Top-1 accuracy	Top-k accuracy	Top-1 EM	Top-k EM	Top-1 F1	Top-k F1
custom_farm_model_Bert_large_uncased + DPR	77.9%	94.8%	68.3%	75.9%	75.3%	89.4%
custom_farm_model_Roberta + ElasticSearch	78.7%	96.4%	47.0%	53.4%	70.8%	82.5%

Note- Fined tuned Bert large model with DPR missed out answering for 13 Question (13 by the reader and 0 by Retriever ) and the Fined Tuned Roberta model missed out answering for 18 questions (17 by the reader and 1 by retriever) .

Evaluation of questions with the short answer -

Fine- Tuned Models + Retrieve	Top-1 accuracy	Top-k accuracy	Top-1 EM	Top-k EM	Top-1 F1	Top-k F1
custom_farm_model_Bert_large_uncased + DPR	95.3%	100.0%	87.9%	92.5%	93.0%	97.7%
custom_farm_model_Roberta+ElasticSearch	90.7%	98.1%	64.5%	69.2%	85.1%	91.4%

Note - Fined Tuned Roberta model missed out answering for 2 questions, while there was no missout in the Fined tuned Bert large model. There were no missing out of answering a question by any of the Retriever.

It is observed that Fined tuned Berta Large uncased and Fine tuned Roberta have satisfactorytop-1 and top-k accuracy on Question with short answer both above 90 %, given fined tuned Roberta missed answering for 2 questions. And if we look at Exact match metric , Fine tuned Bert large performs much better than Fined tuned Roberta, consequently, the F1 score of former model is better than latter.

Evaluation of questions with the long answer -

Fine- Tuned Models + Retrieve	Top-1 accuracy	Top-k accuracy	Top-1 EM	Top-k EM	Top-1 F1	Top-k F1
custom_farm_model_Bert_large_uncased + DPR	64.1%	90.1%	52.8%	62.7%	61.2%	82.5%
custom_farm_model_Roberta+ElasticSearch	69.0%	94.4%	33.1%	40.8%	59.3%	75.1%

Note - Fined tuned Bert large model with DPR missed out answering for 14 Question (by the reader) and the Fined Tuned Roberta model missed out answering for 8 questions (by the reader). Again there were no missing out of answering a question by any of the Retriever.

It is observed that Fined tuned Berta Large uncased and Fine tuned Roberta have top-1 accuracy of around 65%-70% , while top-k accuracy is above 90% on Question with large answer. Fined tuned Roberta has better accuracy because it is missing only 8 Questions as compared to 14 Question missout for Bert Large uncased. If we look at Exact match metric , Fine tuned Bert large performs much better than Fined tuned Roberta, consequently, the F1 score of former model is better than latter. I have looked into some samples for both cases, I found for Questions with large answers, the predicted answers fail to pick up last part of the whole answer, hence for that particular question it fails to score 100% in Exact Match for the answer, this count of sample is high in case of Fine tuned Roberta model as compared to Fine tuned Bert Large Uncased, that is why EM score has so much difference.

To your previous question - About bert-large vs roberta. The numbers are a bit strange, e.g. top1accuracy being higher for roberta but EM is so much smaller , I will try to put down the summary below.

For questions with short answer, both models are performing fine in accuracy metric, but Bert Large Uncased performs better than Roberta model in EM metric , hence F1 score of Bert Large Uncased is more as compared to Roberta
For questions with large answer, in terms of accuracy Roberta model performs better than Bert Large Uncased as former misses out only 8 Questions and latter misses out only 14 questions, but EM score is high in case of Bert Large Uncased , as Roberta model fails to pick up last part of the whole string in predicted (long) answer.
Lastly, in the first table for 250 QA, top 1 accuracy of Roberta model is slightly higher (78.7%) as compared to bert large model(77.9%) because I have 142 Question with long answer ( there is higher missing out in Bert Large Uncased), but EM score of Bert Large Uncased is higher than Roberta model in both short and long answer.

I am working cleanong of more documents and their annotation, this will increase the size of training-validation dataset, there after I can see how these 2 models preforms on bigger training set.

Also, I would like to know, what are the ways to improve time computation on Finder, are there any hyperparameters I should consider apart from top-k-retiever and top-k-reader which will decrease the reading time in Finder , currently the custom roberta model + ElasticSearch takes 7 seconds and custom Bert large uncased + DPR takes +20 seconds in Google colab one-GPU system

Timoeller commented 4 years ago

Thanks for a really stimulating call. Really liked your explanations of the use case. The approach and numbers are really great, so we might be able to write and publish some content about this work. Looking forward seeing this happen.

I will close the issue for now since the original question seems resolved. Feel free to reopen or open other issues if needed.

deepset-ai / haystack

Different answers for closely related questions. #276