Reproducing Exact Match accuracy for CuratedTrec and WikiMovies

kolk commented 6 years ago

Hi, I tried to reproduce the Exact match accuracy for CuratedTrec(19.7%) and WikiMovies(24.5%) as listed in Table 6 of the paper. The accuracy I get on single model trained on SQUAD and tested on CuratedTrec is 5.04% and on WikiMovies is 6.37%. The steps I followed are as follows:

python scripts/pipleline/predict.py data/datasets/CuratedTrec-test.txt --out-dir out_pipeline/ --reader-model models/squad/20180528-9275f860.mdl --retriever-model data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --doc-db data/wikipedia/docs.db --embedding-file data/embeddings/glove.840B.300d.txt --tokenize spacy --batch-size 3

python scripts/pipeline/eval.py data/datasets/CuratedTrec-test.txt out_pipeline/CuratedTrec-test-20180528-9275f860-pipeline.preds

The pipeline module with default batch size of 128 did not fit in 62GB RAM, so I have used a batch size of 3. Is there a mistake in my understanding of the steps to be followed for reproducing the output? Please help.

ajfisch commented 6 years ago

Hi,

A couple of things.

CuratedTREC answers are in the form of regular expressions. You need to supply the --regex flag to the evaluation.
As noted in the paper, WikiMovies has a candidate file: you should specify this with --candidate-file data/dataset/WikiMovies-entities.txt.
You are using the spaCy tokenizer. Did you train your model with the SQuAD data preprocessed with spaCy (not CoreNLP, which is the default)? This will make a slight difference.

I trained a model on SQuAD using the spaCy tokenizer. Here are my results:

scripts/reader/train.py --tune-partial 1000 --use-pos f --use-ner f --use-lemma f  --train-file SQuAD-v1.1-train-processed-spacy.txt --dev-file SQuAD-v1.1-dev-processed-spacy.txt

This gets EM = 68.0 and F1 = 77.5 (trained with CoreNLP it will get 68.4/78.1).

Running on CuratedTREC:

python scripts/pipeline/predict.py data/datasets/CuratedTrec-test.txt --reader-model /tmp/drqa-models/20180529-d3caf05f.mdl --embedding-file data/embeddings/glove.840B.300d.txt --tokenizer spacy

python scripts/pipeline/eval.py data/datasets/CuratedTrec-test.txt /tmp/CuratedTrec-test-20180529-d3caf05f-pipeline.preds --regex

--------------------------------------------------
Dataset: data/datasets/CuratedTrec-test.txt
Predictions: /tmp/CuratedTrec-test-20180529-d3caf05f-pipeline.preds
{'exact_match': 20.605187319884728}

Running on WikiMovies:

python scripts/pipeline/predict.py data/datasets/WikiMovies-test.txt --reader-model /tmp/drqa-models/20180529-d3caf05f.mdl --embedding-file data/embeddings/glove.840B.300d.txt --candidate-file data/datasets/WikiMovies-entities.txt --tokenizer spacy

python scripts/pipeline/eval.py data/datasets/WikiMovies-test.txt /tmp/WikiMovies-test-20180529-d3caf05f-pipeline.preds

--------------------------------------------------
Dataset: data/datasets/WikiMovies-test.txt
Predictions: /tmp/WikiMovies-test-20180529-d3caf05f-pipeline.preds
{'exact_match': 24.035369774919616}

kolk commented 6 years ago

Thank you for the quick reply. The --regex parameter gave the expected results. With both --candidate and --regex the exact_result score for CuratedTrec is 19.3083573487032 and for WikiMovies is 24.3870578778135.

facebookresearch / DrQA

Reproducing Exact Match accuracy for CuratedTrec and WikiMovies #146