Closed bung87 closed 5 years ago
The scripts/convert/squad.py
produces a different file type than what you're looking at. That script puts the SQuAD questions in the expected format for the open-domain setting.
load_data
is a Document Reader utility function, and is expecting a dataset in a preprocessed SQuAD format, where the text answer has been converted into token start and end offsets.
Because the token boundaries produced by the tokenizer don't always match up to the exact text span, there are sometimes questions that the preprocessing script cannot resolve.
During training this is important, because we need a specific token to use for the loss function. For evaluating, we don't care because we will output a span (which unfortunately will at best only a partial F1 score) -- and this is handled by the evaluation scripts.
so the train file is not the SQuAD-v1.1-train.txt
through download.sh
? I see the default param is SQuAD-v1.1-train-processed-corenlp.txt
ah I see it needs preprocess after converting, thanks for your help!
--train-file expect processed txt file which formatted as json lines.
DrQA/drqa/reader/utils.py
load_data expect json object has "answers" key, while
skip_no_answer = True
but scripts/convert/squad.py produce json object with "answer" key