OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.72k stars 2.25k forks source link

How to reproduce the result on WMT14 DE-EN? #2003

Closed Yuran-Zhao closed 3 years ago

Yuran-Zhao commented 3 years ago

I've been tried to reproduce the result on WMT14 DE-EN currently. According to the paper "Attention is All You Need", the transformer model should achieve 27.3 BLEU on newstest2014. But I only got 17.2 by sacrebleu.

I have a look at #637 and #1862. However, some commands and parameters are eliminated in OpenNMT-py 2.0, which makes me a little confused.

The commands I used are as follows:

1. Prepare the data

I used the script here https://github.com/OpenNMT/OpenNMT-py/blob/master/examples/scripts/prepare_wmt_data.sh

#!/bin/bash

##################################################################################
# The default script downloads the commoncrawl, europarl and newstest2014 and
# newstest2017 datasets. Files that are not English or German are removed in
# this script for tidyness.You may switch datasets out depending on task.
# (Note that commoncrawl europarl-v7 are the same for all tasks).
# http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
# http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz
#
# WMT14 http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz
# WMT15 http://www.statmt.org/wmt15/training-parallel-nc-v10.tgz
# WMT16 http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
# WMT17 http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz
# Note : there are very little difference, but each year added a few sentences
# new WMT17 http://data.statmt.org/wmt17/translation-task/rapid2016.tgz
#
# For WMT16 Rico Sennrich released some News back translation
# http://data.statmt.org/rsennrich/wmt16_backtranslations/en-de/
#
# Tests sets: http://data.statmt.org/wmt17/translation-task/test.tgz
##################################################################################

# provide script usage instructions
if [ $# -eq 0 ]
then
    echo "usage: $0 <data_dir>"
    exit 1
fi

# set relevant paths
SP_PATH=/usr/local/bin
DATA_PATH=$1
TEST_PATH=$DATA_PATH/test

CUR_DIR=$(pwd)

# set vocabulary size and source and target languages
vocab_size=32000
sl=de
tl=en

# Download the default datasets into the $DATA_PATH; mkdir if it doesn't exist
mkdir -p $DATA_PATH
cd $DATA_PATH

echo "Downloading and extracting Commoncrawl data (919 MB) for training..."
wget --trust-server-names http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
tar zxvf training-parallel-commoncrawl.tgz
ls | grep -v 'commoncrawl.de-en.[de,en]' | xargs rm

echo "Downloading and extracting Europarl data (658 MB) for training..."
wget --trust-server-names http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz
tar zxvf training-parallel-europarl-v7.tgz
cd training && ls | grep -v 'europarl-v7.de-en.[de,en]' | xargs rm
cd .. && mv training/europarl* . && rm -r training training-parallel-europarl-v7.tgz

echo "Downloading and extracting News Commentary data (76 MB) for training..."
wget --trust-server-names http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
tar zxvf training-parallel-nc-v11.tgz
cd training-parallel-nc-v11 && ls | grep -v news-commentary-v11.de-en.[de,en] | xargs rm
cd .. && mv training-parallel-nc-v11/* . && rm -r training-parallel-nc-v11 training-parallel-nc-v11.tgz

Validation and test data are put into the $DATA_PATH/test folder
echo "Downloading and extracting newstest2014 data (4 MB) for validation..."
wget --trust-server-names http://www.statmt.org/wmt14/test-filtered.tgz
echo "Downloading and extracting newstest2017 data (5 MB) for testing..."
wget --trust-server-names http://data.statmt.org/wmt17/translation-task/test.tgz
tar zxvf test-filtered.tgz && tar zxvf test.tgz
cd test && ls | grep -v '.*deen\|.*ende' | xargs rm
cd .. && rm test-filtered.tgz test.tgz && cd ..

# set training, validation, and test corpuses
corpus[1]=commoncrawl.de-en
corpus[2]=europarl-v7.de-en
corpus[3]=news-commentary-v11.de-en
#corpus[3]=news-commentary-v12.de-en
#corpus[4]=news.bt.en-de
#corpus[5]=rapid2016.de-en

validset=newstest2014-deen
testset=newstest2017-ende

cd $CUR_DIR

retrieve file preparation from Moses repository
wget -nc \
        https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/ems/support/input-from-sgm.perl \
        -O $TEST_PATH/input-from-sgm.perl

##################################################################################
# Starting from here, original files are supposed to be in $DATA_PATH
# a data folder will be created in scripts/wmt
##################################################################################

export PATH=$SP_PATH:$PATH

# Data preparation using SentencePiece
# First we concat all the datasets to train the SP model
if true; then
 echo "$0: Training sentencepiece model"
 rm -f $DATA_PATH/train.txt
 for ((i=1; i<= ${#corpus[@]}; i++))
 do
  for f in $DATA_PATH/${corpus[$i]}.$sl $DATA_PATH/${corpus[$i]}.$tl
   do
    cat $f >> $DATA_PATH/train.txt
   done
 done
 spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl \
           --vocab_size=$vocab_size --character_coverage=1
 rm $DATA_PATH/train.txt
fi

# Second we use the trained model to tokenize all the files
# This is not necessary, as it can be done on the fly in OpenNMT-py 2.0
# if false; then
#  echo "$0: Tokenizing with sentencepiece model"
#  rm -f $DATA_PATH/train.txt
#  for ((i=1; i<= ${#corpus[@]}; i++))
#  do
#   for f in $DATA_PATH/${corpus[$i]}.$sl $DATA_PATH/${corpus[$i]}.$tl
#    do
#     file=$(basename $f)
#     spm_encode --model=$DATA_PATH/wmt$sl$tl.model < $f > $DATA_PATH/$file.sp
#    done
#  done
# fi

# We concat the training sets into two (src/tgt) tokenized files
# if false; then
#  cat $DATA_PATH/*.$sl.sp > $DATA_PATH/train.$sl
#  cat $DATA_PATH/*.$tl.sp > $DATA_PATH/train.$tl
# fi

#  We use the same tokenization method for a valid set (and test set)
# if true; then
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-src.$sl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/valid.$sl.sp
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-ref.$tl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/valid.$tl.sp
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-src.$sl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/test.$sl.sp
#  perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-ref.$tl.sgm \
#     | spm_encode --model=$DATA_PATH/wmt$sl$tl.model > $DATA_PATH/test.$tl.sp
# fi

# Parse the valid and test sets
if true; then
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-src.$sl.sgm \
    > $DATA_PATH/valid.$sl
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$validset-ref.$tl.sgm \
    > $DATA_PATH/valid.$tl
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-ref.$sl.sgm \
    > $DATA_PATH/test.$sl
 perl $TEST_PATH/input-from-sgm.perl < $TEST_PATH/$testset-src.$tl.sgm \
    > $DATA_PATH/test.$tl
fi

And the command is ./prepare_wmt_data.sh ../../onmt_data/wmt14-de-en.

2. Build the vocabulary

onmt_build_vocab -config wmt14-de-en.yml -n_sample -1

3. Train the model

python train.py --config ./examples/scripts/wmt14-de-en.yml

The contents in the wmt14-de-en.yml are:

save_data: /OpenNMT-py/onmt_data/wmt14-de-en/run/example
# Where the vocab(s) will be written
src_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src
tgt_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src

# Corpus opts:
data:
    commoncrawl:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/commoncrawl.de-en.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/commoncrawl.de-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 23
    europarl:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/europarl-v7.de-en.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/europarl-v7.de-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 19
    news_commentary:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/news-commentary-v11.de-en.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/news-commentary-v11.de-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 3
    valid:
        path_src: /OpenNMT-py/onmt_data/wmt14-de-en/valid.de
        path_tgt: /OpenNMT-py/onmt_data/wmt14-de-en/valid.en
        transforms: [sentencepiece]

### Transform related opts:
#### Subword
src_subword_model: /OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model
tgt_subword_model: /OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0

#### Filter
src_seq_length: 100
tgt_seq_length: 100

# silently ignore empty lines in the data
skip_empty_level: silent

# # Vocab opts
# ### vocab:
src_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src
tgt_vocab: /OpenNMT-py/onmt_data/wmt14-de-en/run/example.vocab.src
src_vocab_size: 32000
tgt_vocab_size: 32000
vocab_size_multiple: 8
# src_words_min_frequency: 1
# tgt_words_min_frequency: 1
share_vocab: True

# # Model training parameters

# General opts
save_model: ./onmt_data/wmt14-de-en/run/model
keep_checkpoint: 50
save_checkpoint_steps: 10000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 200000
valid_steps: 10000

# Batching
queue_size: 10000
bucket_size: 32768
# pool_factor: 8192
world_size: 2
gpu_ranks: [0, 1]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2.0
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true

4. Translate and evaluate

spm_encode --model=/OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model \
        < /OpenNMT-py/onmt_data/wmt14-de-en/test.en \
        > /OpenNMT-py/onmt_data/wmt14-de-en/test.en.sp
spm_encode --model=/OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model \
        < /OpenNMT-py/onmt_data/wmt14-de-en/test.de \
        > /OpenNMT-py/onmt_data/wmt14-de-en/test.de.sp

for checkpoint in /OpenNMT-py/onmt_data/wmt14-de-en/run/model_step*.pt; do
        echo "# Translating with checkpoint $checkpoint"
        base=$(basename $checkpoint)
        python ../../translate.py \
                -gpu 0 \
                -batch_size 16384 -batch_type tokens \
                -beam_size 5 \
                -model $checkpoint \
                -src /OpenNMT-py/onmt_data/wmt14-de-en/test.de.sp \
                -tgt /OpenNMT-py/onmt_data/wmt14-de-en/test.en.sp \
                -output /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}.sp
done

for checkpoint in /OpenNMT-py/onmt_data/wmt14-de-en/run/model_step*.pt; do
        base=$(basename $checkpoint)
        spm_decode \
                -model=/OpenNMT-py/onmt_data/wmt14-de-en/wmtdeen.model \
                -input_format=piece \
                < /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}.sp \
                > /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}
done

for checkpoint in /OpenNMT-py/onmt_data/wmt14-de-en/run/model_step*.pt; do
        echo "$checkpoint"
        base=$(basename $checkpoint)
        sacrebleu /OpenNMT-py/onmt_data/wmt14-de-en/test.en < /OpenNMT-py/onmt_data/wmt14-de-en/test.en.hyp_${base%.*}
done

Is there something wrong?

patrick-nanys commented 3 years ago

Has there been any advancements on this? I'm having the same problem.

Yuran-Zhao commented 3 years ago

Has there been any advancements on this? I'm having the same problem.

I used OpenNMT-py 1.x version to reproduce the result successfully. If you are interested, you can find it in my repository with a tutorial to reproduce it.

chijianlei commented 3 years ago

I have met the same problem, have you solved it in 2.0 version?

Yuran-Zhao commented 3 years ago

I have met the same problem, have you solved it in 2.0 version?

No... actually, I want to give up on the 2.0 version :D.

francoishernandez commented 3 years ago

FYI I just ran the 2.0 example from scratch with a fresh pip install OpenNMT-py==2.0.1), without touching anything. With the checkpoint at 75k steps, I get 25.7 BLEU on valid (newstest2014) and 27.0 on test (newstest2017). Not sure what's going on with your 17.2, maybe some tokenization issues/mismatch?

chijianlei commented 3 years ago

I have successfully run the 2.0 example now, one issue is that I need to add -share_vocab parameter which the example does not contain. I will report the result as soon as I complete the training.

francoishernandez commented 3 years ago

Yes you're right I forgot to mention the vocab options may not be fully up to date in the example. It would be great if you could open a PR to update the example with the working adaptation.