Reproducing Evaluation Score for Pretrained WMT'14 English to French Transformer

samsontmr commented 4 years ago

Hi! I tried running generate to evaluate transformer.wmt14.en-fr on the WMT'14 test set but was only able to get a BLEU score of 35.42. I ran prepare-wmt14en2fr.sh and fairseq-preprocess on the data beforehand as well. Could you share the command for evaluating the Transformer ENFR WMT'14 model?

Here is what I'm using:

fairseq-generate data-bin/wmt14_en_fr/ \
    --path checkpoints/wmt14.en-fr.joined-dict.transformer/model.pt \
    --batch-size 256 --beam 4 --remove-bpe --lenpen 0.6

I tried a beam of 5 as well but it didn't give much better results.

I also got this message even though the file is tokenized:

WARNING:root:That's 100 lines that end in a tokenized period ('.')
WARNING:root:It looks like you forgot to detokenize your test data, which may hurt your score.
WARNING:root:If you insist your data is detokenized, or don't care, you can suppress this message with '--force'.

Thanks!

munael commented 4 years ago

There's no --force for fairseq-train. Should also document how to silence that warning (although to be honest it ought to be silencing itself after the first alert anyway, IMHO...).

abhishek0318 commented 3 years ago

I am also trying to reproduce the score of transformer.wmt14.en-fr model, but I am not able to.

Here is the script I use.

#! /usr/bin/bash

# download the data
data=wmt14_en_fr
mkdir -p $data
sacrebleu -t wmt14 -l en-fr --echo src > $data/test.raw.en
sacrebleu -t wmt14 -l en-fr --echo ref > $data/test.raw.fr

split=test
model=wmt14.en-fr.joined-dict.transformer
src=en
tgt=fr
maxtokens=16000

git clone https://github.com/moses-smt/mosesdecoder.git

set -e

SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl

# echo "normalising punctuations and tokenizing the data"
cat $data/$split.raw.$src | $NORM_PUNC $src | $REM_NON_PRINT_CHAR | $TOKENIZER -threads 8 -aq -l $src > $data/$split.$src.tok

# echo "applying bpe"
subword-nmt apply-bpe -c $model/bpecodes < $data/$split.$src.tok > $data/$split.$src

# echo "converting into binary form"
fairseq-preprocess "--"$split"pref" $data/$split --destdir data-bin/$data --srcdict $model/dict.$src.txt --tgtdict $model/dict.$tgt.txt --workers 8 -s $src -t $tgt --only-source

# echo "copying dictionary"
cp $model/dict.$tgt.txt data-bin/$data/

echo "generating hypothesis"
fairseq-generate data-bin/$data/ --path $model/model.pt --skip-invalid-size-inputs-valid-test \
--max-tokens $maxtokens --remove-bpe --gen-subset $split --beam 4 --lenpen 0.6 | tee $data/$split.$src.out

grep ^H $data/$split.$src.out | cut -c3- | sort -nk1 | cut -f3 | ./mosesdecoder/scripts/tokenizer/detokenizer.perl -q > $data/$split.$src.hyp

cat $data/$split.$src.hyp | sacrebleu -t wmt14 -l $src-$tgt
# BLEU+case.mixed+lang.en-fr+numrefs.1+smooth.exp+test.wmt14+tok.13a+version.1.5.1 = 35.6 62.7/41.7/29.4/20.9 (BP = 1.000 ratio = 1.047 hyp_len = 80924 ref_len = 77306)

As you can see, I get a BLEU score of 35.6 - similar to what @samsontmr reported. But the paper reports 41.4 BLEU score.

@myleott, could you point what I am doing wrong, and how can I get close to the reported score. Thanks!

abhishek0318 commented 3 years ago

cc @edunov @michaelauli

abhishek0318 commented 3 years ago

@samsontmr Were you able to reproduce the results?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq

Reproducing Evaluation Score for Pretrained WMT'14 English to French Transformer #1411