Retriever Vectorizer is the bottleneck : Improving over TF-IDF & BM25 vectorizers

Hi, Thanks for the great work on this project. This is a very helpful library for closed domain Q&A. That being said, it seems through my experiments that the performance of the retriever is the bottleneck (reader performance is pretty good).

Upon investigating the code and studying the architecture, it seems like the retirever is the bottleneck.

As the BERT model is only invoked after getting the initial candidates from TF-IDF. So if the TF-IDF or BM25 miss out on the correct candidate paragraphs - the BERT model would miss out on the right answer as well. Which seems to indicate that the BERT model is completely dependent on the accuracy of the vectorizers.

Do you have any thoughts on how to improve the retriever accuracy and using deep learning based information retrieval (maybe sentence similarity based metrics). Any suggestions on more advanced vectorizers ?

Thanks. :)

cdqa-suite / cdQA

Retriever Vectorizer is the bottleneck : Improving over TF-IDF & BM25 vectorizers #344