Open p-null opened 5 years ago
You would need to save your dataset in the MultiQA format. This format is described in the dataset readme https://github.com/alontalmor/MultiQA/tree/master/datasets, and it also comes with a JSON-schema checker for the format you output. I think the fastest approach is just to copy the code for one of the datasets that close, say SQuAD1.1, make the changes needed, and build your dataset using:
python build_dataset.py --dataset_name MyDataset --split train --output_file path/to/output.jsonl.gz --n_processes 10
(as described in the main readme)
Hope this helps.
Thanks for the info. I follow the MultiQA format to form the dataset. It seems that predict.py will also call the evaluation function while we usually don't have the golden label for test dataset. I got the following error when running prediction on my own dataset. I think It is due ot calling the evalution function.
0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "predict.py", line 110, in <module>
predict(args)
File "predict.py", line 38, in predict
curr_pred, full_predictions = predictor.predict_json(context)
File "/content/MultiQA/models/multiqa_predictor.py", line 27, in predict_json
min(offset+20, len(question_instances))])
File "/usr/local/lib/python3.6/dist-packages/allennlp/predictors/predictor.py", line 213, in predict_batch_instance
outputs = self._model.forward_on_instances(instances)
File "/usr/local/lib/python3.6/dist-packages/allennlp/models/model.py", line 153, in forward_on_instances
outputs = self.decode(self(**model_input))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/content/MultiQA/models/multiqa_bert.py", line 195, in forward
f1_score = squad_eval.metric_max_over_ground_truths(squad_eval.f1_score, best_span_string, gold_answer_texts)
File "/usr/local/lib/python3.6/dist-packages/allennlp/tools/squad_eval.py", line 52, in metric_max_over_ground_truths
return max(scores_for_ground_truths)
ValueError: max() arg is an empty sequence
It's good to have evaluation metric in evaluate command but usually we don't have golden labels in test data.
Because the data I have only have passage, question id and question. So I fill the fields like answers
as ""
and do not provide span information.
Suppose I have a document and a question, I'd like to get the answer span and answer string.
What steps should I take to get what I want?
(I tried to format it as multiqa format, that is like
and dump it to
test.gz
and use predict likepython predict.py --model https://multiqa.s3.amazonaws.com/models/BERTBase/SQuAD1-1.tar.gz --dataset test.gz --dataset_name SQuAD --cuda_device 0