cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
616 stars 191 forks source link

Retriever Vectorizer is the bottleneck : Improving over TF-IDF & BM25 vectorizers #344

Open raghavgurbaxani opened 4 years ago

raghavgurbaxani commented 4 years ago

Hi, Thanks for the great work on this project. This is a very helpful library for closed domain Q&A. That being said, it seems through my experiments that the performance of the retriever is the bottleneck (reader performance is pretty good).

Upon investigating the code and studying the architecture, it seems like the retirever is the bottleneck. image

As the BERT model is only invoked after getting the initial candidates from TF-IDF. So if the TF-IDF or BM25 miss out on the correct candidate paragraphs - the BERT model would miss out on the right answer as well. Which seems to indicate that the BERT model is completely dependent on the accuracy of the vectorizers.

Do you have any thoughts on how to improve the retriever accuracy and using deep learning based information retrieval (maybe sentence similarity based metrics). Any suggestions on more advanced vectorizers ?

Thanks. :)