What's the exact command/dataset to reproduce a BLEU of 30.42 on test set?

soloice commented 6 years ago

The instructions in readme.md describes how to train an English-to-German translation model and apply it on test data. But how did you evaluate the result?

This is what I did:

src=en
tgt=de

# Merge subwords
sed -r 's/(@@ )|(@@ ?$)//g' $nmt_output_dir/test.txt > $nmt_output_dir/test.merged-subwords.txt

# Detruecase NMT outputs
$moses_scripts/recaser/detruecase.perl < $nmt_output_dir/test.merged-subwords.txt > $nmt_output_dir/test.merged-bpe32k.detc

# Detokenize
$moses_scripts/tokenizer/detokenizer.perl -l $tgt < $nmt_output_dir/test.merged-bpe32k.detc > $nmt_output_dir/test.merged-bpe32k.txt

# Evaluation
# Method II: using mteval
# wrap up outputs with SGML format
$moses_scripts/ems/support/wrap-xml.perl $tgt $test_sgm_dir/newstest2017-$src$tgt-src.$src.sgm < $nmt_output_dir/test.merged-bpe32k.txt > $nmt_output_dir/test.merged-bpe32k.sgm

$moses_scripts/generic/mteval-v14.pl -r $test_sgm_dir/newstest2017-$src$tgt-ref.$tgt.sgm -s $test_sgm_dir/newstest2017-$src$tgt-src.$src.sgm -t $nmt_output_dir/test.merged-bpe32k.sgm > mteval-result.txt

I used the default hyper-parameters to train the model (except for batch_size=80), and got a BLEU of 22.47 only:

 Evaluation of any-to-de translation using:
    src set "newstest2017" (130 docs, 3004 segs)
    ref set "newstest2017" (1 refs)
    tst set "newstest2017" (1 systems)

length ratio: 1.01165010524255 (62001/61287), penalty (log): 0
NIST score = 6.6085  BLEU score = 0.2247 for system "Edinburgh"

# ------------------------------------------------------------------------

Individual N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
        ------   ------   ------   ------   ------   ------   ------   ------   ------
 NIST:  5.0732   1.2981   0.2044   0.0291   0.0037   0.0007   0.0001   0.0000   0.0000  "Edinburgh"

 BLEU:  0.5593   0.2821   0.1633   0.0989   0.0616   0.0388   0.0250   0.0164   0.0108  "Edinburgh"

# ------------------------------------------------------------------------
Cumulative N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
        ------   ------   ------   ------   ------   ------   ------   ------   ------
 NIST:  5.0732   6.3713   6.5757   6.6048   6.6085   6.6092   6.6094   6.6094   6.6094  "Edinburgh"

 BLEU:  0.5593   0.3972   0.2954   0.2247   0.1735   0.1352   0.1062   0.0841   0.0669  "Edinburgh"

What could be wrong? And according to my experience in NMT, a BLEU score of 30 is kind of high for English-to-German translation system on newstest2017 data. For example, in this work, the English -> German system just got a BLEU of <26. And the winner of WMT'17 only get a BLEU of 28.3, see http://matrix.statmt.org/.

soloice commented 6 years ago

Update: another model, which is trained on 2 GTX 1080 Ti cards and with batch_size 64, achieved a BLEU score of 23.09. I guess the result could be better if I train it longer.

Playinf commented 6 years ago

The benchmark is trained on De-En direction which has a higher BLEU score. I think your result is reasonable. This basic RNNsearch model cannot match the result achieved by WMT winners. Their system usually uses other techniques like back-translation, model ensemble and deep models.

soloice commented 6 years ago

I see. I have pretty much experience in training NMT models with Theano, and have tried out a lot more modern techniques such like sequence-level knowledge distillation. I also implemented an encoder-decoder model with attention mechanism in pure C++ and deployed it in cellphones. I'm just looking for an NMT framework in TensorFlow (because Theano is not maintained any more, lol~) and found this one. So I'm shocked by the BLEU score reported in readme.md.

The instruction in the readme.md is for training English->German models, but the benchmark result is for German->English, which is inconsistent. I suggest you to update it to make it less confusing, i.e.: by including my result on En->De translation.

For my experiments, the RNNSearch model trained for 75k steps on 2 GTX 1080 Ti cards with batch_size=64 achieved a BLEU score of 0.2309 and an NIST score of 6.7258. Continuing training it for 75k more steps (150k steps in total) leads to a BLEU score of 0.2348 and an NIST score of 6.7717.

soloice commented 6 years ago

And thanks for your helping these days. I'm considering reading through all the source code in this repo and doing research based on this. Hopefully I'll try to reproduce CSGAN-NMT and VNMT and send a pull request to you to include these more advanced models into the project.

XMUNLP / XMUNMT

What's the exact command/dataset to reproduce a BLEU of 30.42 on test set? #6