cdqa-suite / cdQA

⛔ [NOT MAINTAINED] An End-To-End Closed Domain Question Answering System.
https://cdqa-suite.github.io/cdQA-website/
Apache License 2.0
615 stars 191 forks source link

IndexError while fine tuning model #340

Open raghavgurbaxani opened 4 years ago

raghavgurbaxani commented 4 years ago

Hi,

I am trying to fine tune the Bert model on my custom dataset. I generated the json file from the pdf using df2squad and fed it to the cdqa annotator, added several question answer pairs and generated the new json file.

However, when I try to fine tune using - cdqa_pipeline.fit_reader('cdqa-v1.1.json') I get the following error -

python3.6/site-packages/cdqa/reader/bertqa_sklearn.py", line 190, in read_squad_examples
    answer_offset + answer_length - 1
IndexError: list index out of range

I also tried the fit_transform('cdqa-v1.1.json) method using BertProcessor and still get the same error.

Any idea on what the problem could be ??

fmikaelian commented 4 years ago

Might be a problem with the length of your answers in the json dataset. For example, do you have empty answers?

raghavgurbaxani commented 4 years ago

Hi, Here's a part from the json file (generated from the annotator) -

{"question":"how to install Ethernet connector","id":"a9c80a82-04d6-4ef4-9726-f292816f2bcf","answers":[{"answer_start":-1,"text":"Procedure Insert the metal plate"

The answer isnt empty, is there anything wrong with the format ?

I also tried further shortening the answer and generated another json file - now getting the error:


Could not find answer: '' vs. 'Insert the metal plate'
Traceback (most recent call last):
  File "temp.py", line 28, in <module>
    cdqa_pipeline.fit_reader('2.json') #cdqa-v1.1.json
/lib/python3.6/site-packages/cdqa/reader/bertqa_sklearn.py", line 1291, in fit
    train_sampler = RandomSampler(train_data)
/python3.6/site-packages/torch/utils/data/sampler.py", line 94, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

:(
tianpaul01 commented 4 years ago

Hi. Maybe you should remove that one. I already encountered this problem. You can only use the annotator for dataset like SQuAD v. 1.1. It should have a direct answer from the paragraph and not by putting the answer in the answer box..