alontalmor / MultiQA

139 stars 23 forks source link

How can I predict on my own dataset? #16

Open p-null opened 5 years ago

p-null commented 5 years ago

Suppose I have a document and a question, I'd like to get the answer span and answer string.

What steps should I take to get what I want?

(I tried to format it as multiqa format, that is like

js_obj = [{"id": "HotpotQA_5a85ea095542994775f606a8",
"context": {
  "documents":[{"text": "passage_sentences"
   }
              ]
        },
"qas":["question_sentence?"]}]

and dump it to test.gz and use predict like python predict.py --model https://multiqa.s3.amazonaws.com/models/BERTBase/SQuAD1-1.tar.gz --dataset test.gz --dataset_name SQuAD --cuda_device 0

alontalmor commented 5 years ago

You would need to save your dataset in the MultiQA format. This format is described in the dataset readme https://github.com/alontalmor/MultiQA/tree/master/datasets, and it also comes with a JSON-schema checker for the format you output. I think the fastest approach is just to copy the code for one of the datasets that close, say SQuAD1.1, make the changes needed, and build your dataset using: python build_dataset.py --dataset_name MyDataset --split train --output_file path/to/output.jsonl.gz --n_processes 10 (as described in the main readme)

Hope this helps.

p-null commented 5 years ago

Thanks for the info. I follow the MultiQA format to form the dataset. It seems that predict.py will also call the evaluation function while we usually don't have the golden label for test dataset. I got the following error when running prediction on my own dataset. I think It is due ot calling the evalution function.

  0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "predict.py", line 110, in <module>
    predict(args)
  File "predict.py", line 38, in predict
    curr_pred, full_predictions = predictor.predict_json(context)
  File "/content/MultiQA/models/multiqa_predictor.py", line 27, in predict_json
    min(offset+20, len(question_instances))])
  File "/usr/local/lib/python3.6/dist-packages/allennlp/predictors/predictor.py", line 213, in predict_batch_instance
    outputs = self._model.forward_on_instances(instances)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/models/model.py", line 153, in forward_on_instances
    outputs = self.decode(self(**model_input))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/MultiQA/models/multiqa_bert.py", line 195, in forward
    f1_score = squad_eval.metric_max_over_ground_truths(squad_eval.f1_score, best_span_string, gold_answer_texts)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/tools/squad_eval.py", line 52, in metric_max_over_ground_truths
    return max(scores_for_ground_truths)
ValueError: max() arg is an empty sequence

It's good to have evaluation metric in evaluate command but usually we don't have golden labels in test data. Because the data I have only have passage, question id and question. So I fill the fields like answers as "" and do not provide span information.