facebookresearch / DrQA

Reading Wikipedia to Answer Open-Domain Questions
Other
4.48k stars 898 forks source link

reader training json key error #183

Closed bung87 closed 5 years ago

bung87 commented 5 years ago

--train-file expect processed txt file which formatted as json lines.

DrQA/drqa/reader/utils.py

load_data expect json object has "answers" key, while skip_no_answer = True

but scripts/convert/squad.py produce json object with "answer" key

ajfisch commented 5 years ago

The scripts/convert/squad.py produces a different file type than what you're looking at. That script puts the SQuAD questions in the expected format for the open-domain setting.

load_data is a Document Reader utility function, and is expecting a dataset in a preprocessed SQuAD format, where the text answer has been converted into token start and end offsets.

Because the token boundaries produced by the tokenizer don't always match up to the exact text span, there are sometimes questions that the preprocessing script cannot resolve.

During training this is important, because we need a specific token to use for the loss function. For evaluating, we don't care because we will output a span (which unfortunately will at best only a partial F1 score) -- and this is handled by the evaluation scripts.

bung87 commented 5 years ago

so the train file is not the SQuAD-v1.1-train.txt through download.sh ? I see the default param is SQuAD-v1.1-train-processed-corenlp.txt

ajfisch commented 5 years ago

Yes -- see the instructions in the reader readme.

bung87 commented 5 years ago

ah I see it needs preprocess after converting, thanks for your help!