allenai / document-qa

Apache License 2.0
434 stars 122 forks source link

Discrepancy between EM/F1 scores logged during training and those output from squad_eval.py #14

Closed bdhingra closed 6 years ago

bdhingra commented 6 years ago

Hi,

Thanks for open-sourcing your code!

I noticed a discrepancy between the EM and F1 scores logged during training and those computed when evaluating the model separately using docqa/eval/squad_eval.py. The difference is significant at the beginning of training, but becomes small by the end of training. It'll be super helpful if you could explain where the difference comes from, and more importantly, which are the "correct" scores.

A couple of disclaimers before I describe in more detail:

  1. I am running python3.4 and not >=3.5.
  2. I am using json package and not ujson package in read_data.py.

Unfortunately, my compute environment does not allow me to change the above to see if the problem persists with python3.5 and ujson. However, the code runs fine, and I believe the problem to be somewhere else. Please correct me if I am wrong though.

So I am running the paragraph setting on Squad as follows:

python3.4 docqa/scripts/ablate_squad.py paragraph output/squad

And I am evaluating the output checkpoints as follows:

python3.4 docqa/eval/squad_eval.py -o output/squad-1205-140719/dev-output.json -c dev output/squad-1205-140719/ -s <checkpoint_number>

The output scores I see on Tensorboard and from the evaluation script are as follows:

Update Tensorboard Acc Tensorboard F1 Tensorboard text-EM Tensorboard text-F1 squad_eval.py Acc squad_eval.py F1 squad_eval.py text-EM squad_eval.py text-F1
1200 0.4685 0.5886 0.4896 0.6080 0.1755 0.2726 0.1853 0.3044
2400 0.5479 0.6610 0.5700 0.6798 0.4572 0.5786 0.4781 0.5992
3600 0.5647 0.6790 0.5886 0.6966 0.5506 0.6688 0.5746 0.6861
4800 0.5951 0.7077 0.6192 0.7239 0.5842 0.6980 0.6094 0.7154
10800 0.6377 0.7501 0.6675 0.7667 0.6437 0.7508 0.6707 0.7662

As you can see the squad_eval.py is much lower than the Tensorboard performance initially, but catches up with it around update 5000. Later it even becomes slightly better.

I guess my main questions are --

  1. Does this happen in your setup? If not, then it is probably something to do with python3.4 / 3.5.
  2. If this does happen in your setup, can you point to why? Also which performance is the correct one?

The reason why I am interested in the initial performance is because I am running some experiments with only 10% of squad training set. In this case there is a big difference in the performances logged during training and from the evaluation script, similar to the top rows of the table above.

Thanks a lot for your time! Bhuwan

chrisc36 commented 6 years ago

I think that is because we are using exponential moving average (EMA) weights. At test time we use the EMA of the weights, but the dev score computed during the training cycle just use the current weights, hence the difference. You can run the eval script with --no_ema and you should see the same scores.

bdhingra commented 6 years ago

Indeed that is the case. There is still a small difference between the two (for example, 0.680580080475 v 0.68047) but it is close enough. Thanks for your help!