alontalmor / MultiQA

138 stars 23 forks source link

Evaluation is done on a small subset? #19

Closed danyaljj closed 4 years ago

danyaljj commented 4 years ago

I have trained my model on SQuAD1-1 and now I am trying to evaluate it with the following command:

> python multiqa.py evaluate --model model --datasets SQuAD1-1,NewsQA --cuda_device 0  --models_dir  /net/nfs.corp/aristo/danielk/MultiQA/models/SQuAD1-1/

Here is the output that I can see:

. 
. 
. 
2020-01-20 15:29:07,779 - INFO - allennlp.common.params - validation_dataset_reader.support_cannotanswer = False
2020-01-20 15:29:08,158 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/danielk/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
2020-01-20 15:29:08,189 - INFO - __main__ - Reading evaluation data from https://multiqa.s3.amazonaws.com/data/SQuAD1-1_dev.jsonl.gz
2020-01-20 15:29:08,189 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.data_iterator.DataIterator'> from params {'batch_size': 6, 'max_instances_in_memory': 5000, 'type': 'basic'} and extras set()
2020-01-20 15:29:08,189 - INFO - allennlp.common.params - validation_iterator.type = basic
2020-01-20 15:29:08,190 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.basic_iterator.BasicIterator'> from params {'batch_size': 6, 'max_instances_in_memory': 5000} and extras set()
2020-01-20 15:29:08,190 - INFO - allennlp.common.params - validation_iterator.batch_size = 6
2020-01-20 15:29:08,190 - INFO - allennlp.common.params - validation_iterator.instances_per_epoch = None
2020-01-20 15:29:08,190 - INFO - allennlp.common.params - validation_iterator.max_instances_in_memory = 5000
2020-01-20 15:29:08,190 - INFO - allennlp.common.params - validation_iterator.cache_instances = False
2020-01-20 15:29:08,190 - INFO - allennlp.common.params - validation_iterator.track_epoch = False
2020-01-20 15:29:08,190 - INFO - allennlp.common.params - validation_iterator.maximum_samples_per_batch = None
2020-01-20 15:29:08,190 - INFO - allennlp.training.util - Iterating over dataset
  0%|                                                                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]2020-01-20 15:29:08,624 - INFO - models.multiqa_reader - Total number of processed questions for SQuAD is 13
EM: 84.62, f1: 90.77, loss: 1.14 ||: : 3it [00:00,  5.36it/s]                                                                                                                                                                                     
2020-01-20 15:29:08,751 - INFO - __main__ - Finished evaluating SQuAD1-1
2020-01-20 15:29:08,751 - INFO - __main__ - Metrics:
2020-01-20 15:29:08,751 - INFO - __main__ - EM: 84.61538461538461
2020-01-20 15:29:08,751 - INFO - __main__ - f1: 90.76923076923077
2020-01-20 15:29:08,751 - INFO - __main__ - loss: 1.1399052043755848
2020-01-20 15:29:08,765 - INFO - __main__ - Reading evaluation data from https://multiqa.s3.amazonaws.com/data/NewsQA_dev.jsonl.gz
2020-01-20 15:29:08,765 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.data_iterator.DataIterator'> from params {'batch_size': 6, 'max_instances_in_memory': 5000, 'type': 'basic'} and extras set()
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.type = basic
2020-01-20 15:29:08,765 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.basic_iterator.BasicIterator'> from params {'batch_size': 6, 'max_instances_in_memory': 5000} and extras set()
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.batch_size = 6
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.instances_per_epoch = None
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.max_instances_in_memory = 5000
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.cache_instances = False
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.track_epoch = False
2020-01-20 15:29:08,765 - INFO - allennlp.common.params - validation_iterator.maximum_samples_per_batch = None
2020-01-20 15:29:08,766 - INFO - allennlp.training.util - Iterating over dataset
  0%|                                                                                                                                                                                                                       | 0/1 [00:00<?, ?it/s]2020-01-20 15:29:09,087 - INFO - models.multiqa_reader - Total number of processed questions for NewsQA is 15
EM: 17.95, f1: 34.07, loss: 8.59 ||: : 7it [00:00,  8.55it/s]                                                                                                                                                                                     
2020-01-20 15:29:09,585 - INFO - __main__ - Finished evaluating NewsQA
2020-01-20 15:29:09,585 - INFO - __main__ - Metrics:
2020-01-20 15:29:09,585 - INFO - __main__ - EM: 17.94871794871795
2020-01-20 15:29:09,585 - INFO - __main__ - f1: 34.068154068154065
2020-01-20 15:29:09,585 - INFO - __main__ - loss: 8.591846244675773
2020-01-20 15:29:09,588 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmp_j617_kh

Everything looks good, except that it seems that the evaluation is done only on a small subset. In particular the comment that says:

| 0/1 [00:00<?, ?it/s]2020-01-20 15:29:09,087 - INFO - models.multiqa_reader - Total number of processed questions for NewsQA is 15

and

| 0/1 [00:00<?, ?it/s]2020-01-20 15:29:08,624 - INFO - models.multiqa_reader - Total number of processed questions for SQuAD is 13

Any thoughts on this?

alontalmor commented 4 years ago

You haven't specified a config file, so it used the default: models/MultiQA_BERTBase.jsonnet (you may want to change this). There was a bug in that config that it sampled 10 example in the eval set, instead of not sampling at all... i've fixed this, and pushed it.