cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
615 stars 191 forks source link

No module named 'cdqa.reader.reader_sklearn' #237

Closed catch-n-release closed 5 years ago

catch-n-release commented 5 years ago

When trying to use XLNet model in place of BERT I face the following error. Could someone please help me with this?

cdqa_pipeline = QAPipeline(reader='./models/xlnet_cased_vCPU.joblib', max_df=1.0)

fmikaelian commented 5 years ago

Hi @SuyashSrivastavaDel

You are facing this error because the XLNet implementation for cdQA is not ready yet. It is being developped in the sync-huggingface branch. You can follow our progress on this PR.

You can still use cdQA with BERT in the meantime.

catch-n-release commented 5 years ago

Hey thanks for replying @fmikaelian. Could you help me answer a few of my other questions..

1.How many of Q&A pair per paragraph(custom dataset) would I require to train on top of pre-trained bert models? 2.Would the hyperparameters change in retraining the mentioned model?

  1. Also, say If I want to train hotpotQA dataset, would you recommend training the model from scratch or train on given pre-trained BERT model?
andrelmfarias commented 5 years ago

Hi @SuyashSrivastavaDel ,

I will give my answers based on my knowledge and on what I think. @fmikaelian, you can correct me or add something if you think you should.

  1. We don't have a closed number for this, as we trained the model only on SQUAD 1.1 in our experiments. SQUAD 1.1 has 100k QA pairs and we were able to train the model on it, so I would say that you will need at least something around 1k - 10k in total. If your dataset size is as small as 1k I would recommend you to train on SQUAD first (as per our tutorials) then do a 2nd train on your data. Regarding the quantity per paragraph, I would say 4+ questions if the paragraph is long and 2 or 3 if the paragraph is short.

  2. It depends on the hyperparameters, you can fine-tune some training hyperparameters as learning rate, the number of epochs and batch size or some conditions on data such as max sentence length for the paragraph, for the question and for the answer. The idea is the overall structure of the model won't change, for example you cannot tune the number of layers.

  3. I would recommend using the pre-trained model

catch-n-release commented 5 years ago

Hey @andrelmfarias thanks for replying. Could you help me a little further with this..

  1. How big of a document corpus can the TF-IDF retriever handle? A rough estimate would do.
  2. Is there a way to get top 5 or say top 10 closest answers along with the predicted answer? If not, what do you suggest I do to get them?
andrelmfarias commented 5 years ago
  1. We did not try to test the limits of the tf-idf Retriever, but I suppose it can handle very large corpus. We are using the tf-idf vectorizer from sklearn, and from my experience, it can handle large corpus. If ever you try to push its limits can you inform us about the results?

  2. Currently, it's not possible to do it on cdqa directly... You would have to modify the function write_predictions: https://github.com/cdqa-suite/cdQA/blob/cff2d4404953d58880a2b5a3fcca52941a3e53cb/cdqa/reader/bertqa_sklearn.py#L454

by returning final_predictions_sorted: https://github.com/cdqa-suite/cdQA/blob/cff2d4404953d58880a2b5a3fcca52941a3e53cb/cdqa/reader/bertqa_sklearn.py#L639

You also would have to modify the method .predict() of BertQA: https://github.com/cdqa-suite/cdQA/blob/cff2d4404953d58880a2b5a3fcca52941a3e53cb/cdqa/reader/bertqa_sklearn.py#L1242

If ever you are interested in implementing it as an option when using BertQA for predictions, feel free to do a PR 😃

catch-n-release commented 5 years ago

@andrelmfarias Thanks for replying, I am working on it. Will comeback to you with more stupid doubts asap.

JimAva commented 5 years ago

Hi @SuyashSrivastavaDel and @andrelmfarias - any update on the multi-result feature?

Thank you.

catch-n-release commented 5 years ago

Hey @JimAva , my fork of the repo has a branch named feature/n_best_predictions. In that, the final cdqa_pipeline.predict(X=question) function returns the final prediction with an ordered dictionary. In that dictionary you will find the top n prediction in the format as: prediction, n_best_predictions = cdqa_pipeline.predict(X=question) where n_best_predictions={answer1:[title1, paragraph1],answer2:[title2, paragraph2]...........} ranked from most relevant answer to the least. n being equal to n_best_size which is passed as a parameter.

Hey @andrelmfarias I am raising a PR for this like you told me to. :):) Please check and let me know of the corrections if needed.