facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.42k stars 6.4k forks source link

Back translation on ENDE does not improve baseline #1051

Closed WZX1998 closed 4 years ago

WZX1998 commented 5 years ago

Hi, we tried to reproduce back translation on EN->DE following Understanding Back-Translation at Scale but the bleu score did not improve much by adding monolingual data. What we did:

  1. Preprocessed parallel data using scripts in https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-wmt14en2de.sh with BPE of 32K, resulting in 4.6M parallel data.
  2. Trained 2 direction model using transformer-big model with 9 2080Ti and accumulative batch size 14 to mimic 128 GPU setting. Here is our script: python train.py data-bin/16-ende --update-freq 14 --fp16 --source-lang en --target-lang de \ --arch transformer_wmt_en_de --share-all-embeddings \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ --lr 0.0005 --min-lr 1e-09 \ --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 3200 --reset-optimizer --ddp-backend no_c10d \
  3. We got two baseline models with only parallel data and the testing bleu on WMT14 is 28.9(EN->DE) and 27.84(DE->EN) respectively. We also test WMT12 with 22.87(EN->DE) and 24.93(DE->EN).
  4. Translate DE News Crawl18 monolingual data using DE->EN modeI with interactive mode by setting beam size 5. We remove sentences with more than 10% of unk and finally got 10M synthetic data.
  5. Finally, We combined 4.5 bitext with upsample rate of 2 and 10M synthetic data. Using the same setting in step 2. The best we got after updates is 28.63(EN->DE) on WMT14 and 23.03(EN->DE) on WMT12.

The performance on other years are worse than the baseline models which does not match the result reported. Any idea? Thanks!

myleott commented 5 years ago
  1. Translate DE News Crawl18 monolingual data using DE->EN modeI with interactive mode by setting beam size 5. We remove sentences with more than 10% of unk and finally got 10M synthetic data.

From the original paper: "For the backtranslation experiments we use the German monolingual newscrawl data distributed with WMT’18 comprising 226M sentences after removing duplicates."

Why do you only use 10M sentences?

Please also refer to Figure 5 in the original paper, which used 5M bitext and 24M synthetic. We don't see very large gains from backtranslation with beam search in this setting. You should try generating the backtranslations using sampling (--sampling) instead of beam search.

rationalisa commented 5 years ago

Thanks for your reply!

Please also refer to Figure 5 in the original paper, which used 5M bitext and 24M synthetic. We don't see very large gains from backtranslation with beam search in this setting. You should try generating the backtranslations using sampling (--sampling) instead of beam search.

Thanks for emphasizing the effectiveness of sampling method. Definitely we would try sampling instead of beam search and larger amount of monolingual data to see more improvement.

However, according to Figure 1 in Understanding Back-Translation at Scale. The improvements of 11M and 17M total training data are about 0.8 and 0.7 respectively. Our experiment with 4.6M paired data and 10M synthetic data(15M in total) only results in a 0.16 improvement on WMT12.

It seems like a dismatch. Is my understanding correct?

edunov commented 5 years ago

Hi @WZX1998

Yes, 10M back translated sentences should definitely improve translation quality. There might be some issue in your setup, can you please check:

rationalisa commented 5 years ago

Hi @edunov I appreciate your time and help!

  1. We used 9 GPU and an update frequency of 14 to simulate 128 GPU setting and Finally attain 100K updates.
  2. We trained back-translation from scratch.
  3. Yes we applied the BPE token learned from bitext for backtranslation. Also, we noticed the influence of unk tokens so we removed sentences with more than 10% of unk. The total number of unk is less than 1%.
  4. The upsampling method we did might be the problem. We simply concatenated double amount 4.6M parallel data and 10M back-translated data as new training data. Any problem within such method?

Here is our detailed script:

0 Preprocessed parallel data using scripts from https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-wmt14en2de.sh

1 Trained 2 direction model using transformer-big model on 9 2080Ti

python train.py data-bin/16-ende --update-freq 14 --fp16 --source-lang en --target-lang de --arch transformer_wmt_en_de --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 3200 --reset-optimizer --ddp-backend no_c10d

2 Use learned BPE to tokenize Mono data

SCRIPTS=mosesdecoder/scripts TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl CLEAN=$SCRIPTS/training/clean-corpus-n.perl NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl BPEROOT=subword-nmt CORPORA=( "news2018" ) src=de tgt=en lang=de-en prep=mono2018bpe tmp=$prep/tmp orig=mono2018 mkdir -p $orig $tmp $prep

echo "pre-processing train data..." for l in $src; do for f in "${CORPORA[@]}"; do cat $orig/$f.$l | \ perl $NORM_PUNC $l | \ perl $REM_NON_PRINT_CHAR | \ perl $TOKENIZER -threads 8 -l $l >> $tmp/train.$l done done

BPE_CODE=code

for L in $src; do for f in train.$L; do echo "apply_bpe.py to ${f}..." python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f done done

3 Apply previous learned dictionary to binarize mono data

TEXT=data/mono2018 python3 preprocess.py --source-lang de --only-source --workers 24\ --trainpref $TEXT/train \ --destdir data-bin/mono18 --srcdict $TEXT/dict.de.txt

4 Translate de to en using deen model

python3 interactive.py data-bin/16-ende --buffer-size 16 \ --input data/mono_de/train.de.txt --path checkpoints/transformer_deen/checkpoint_best.pt --beam 5 --source-lang de --target-lang en

5 Concatenate Parallel data and Monolingual data with upsample rate 2:1

Binarize again to create final dataset

TEXT=data/mono2018 CUDA_VISIBLE_DEVICES=0 python3.6 preprocess.py --source-lang en --target-lang de --workers 24 \ --trainpref $TEXT/trainmix \ --destdir data-bin/mono18/mix --srcdict data-bin/16-ende/dict.de.txt --tgtdict data-bin/16-ende/dict.de.txt

6 Repeat step1 with updated dataset

Result on WMT12

Baseline: 22.87(EN->DE) and 24.93(DE->EN). Back-translation: 23.03(EN->DE) for one trail did before, and 22.79 (EN->DE) for another trail did yesterday.

edunov commented 5 years ago

@rationalisa

The only obvious difference with our setup is "we removed sentences with more than 10% of unk". We didn't have this step. How many sentences like this did you have? Are these sentences in some other language?

Also, can you share your training log for backtranslation run?

rationalisa commented 5 years ago

@edunov

  1. We downloaded Newscrawl18 : 38M647,678 -> remove 10% unk -> 38M638,664 ->remove len>250 -> 38M459,550. We did so by following Improving Neural Machine Translation Models with Monolingual Data in section 4.1. The unk step did not filter out many sentences and we sampled 10M from resulting 38M sentences.

  2. Regarding the dataset, as your work mentioned

    we use the German monolingual newscrawl data distributed with WMT’18 comprising 226M sentences after removing duplicates.

    Did you mean you use all years data fromNewscrawl and then remove duplicated sentences without removing unk? Was the different amount of training data in figure 1 randomly sub-sampled from such 226M data?

  3. We lost the log but here is the recovered record with first few epoch . backlog.txt

Jack000 commented 5 years ago

I was having poor translation quality also and after doing a sanity check realized that certain NE phrases were translated to gibberish. After spending a lot of time on it, I found that the cause of my problem was that I trained the dictionary and bpe codes on the monolingual dataset, which was much larger than the parallel dataset. Because of this, certain words in the dictionary was never used during the backtranslation training. When generating synthetic data, the model became confused because it encountered these tokens that it never saw during training.

I ultimately just used separate dict/bpe codes for backtranslation/forward translation, and it solved my problem. Anyways, may or may not be the cause of your issue but could be worth looking into.

rationalisa commented 5 years ago

Hi @Jack000 Our issue is different. We didn't train the dictionary and bpe codes on the monolingual dataset.

I ultimately just used separate dict/bpe codes for backtranslation/forward translation, and it solved my problem.

Since source sentences and target sentences share dict/bpe during standard en-de training, we believe back translation model and translation model can use same dict/bpe. I am confused.

edunov commented 5 years ago

@rationalisa from your train log it seems that your training loss goes down really well, but validation doesn't. I'd suspect overfitting, but the dropout is already 0.3 so it is unlikely.

Probably you have some train/validation data mismatch, e.g. they processed differently, or BPEed differently or train data is too simple for some other reason. It is very hard to say what exactly happens without access to the data, I wonder if you can upload text version somewhere?

And yes, we didn't remove unks, where do you see unks, on synthetic side or on the side of monolingual data?

Jack000 commented 5 years ago

Since source sentences and target sentences share dict/bpe during standard en-de training, we believe back translation model and translation model can use same dict/bpe. I am confused.

Since the monolingual data is much larger the parallel data there may be words/subwords that occur frequently in the monoligual data, but not the parallel data. Just as an example, if the data is scraped from the web there may be many &amp; tokens in the mono data but none in the parallel data. When you create the dict/bpe codes from the combined data, the&amp; token is included. When you train the backtranslation model, the model never sees &amp; because it's not in the parallel data. But when generating synthetic data for forward translation the model will encounter an &amp; token and produce random output.

This was an issue for me, but my task isn't en-de translation. Also entirely possible that it's a different issue altogether.

rationalisa commented 5 years ago

@Jack000 Probably different issues but thanks! @edunov We saw unk in monolingual data and here is our dataset: https://drive.google.com/drive/folders/17zILFRFmGFApT9-YS48QZmeXAhiE8pSA?usp=sharing

SunbowLiu commented 5 years ago

You use the transformer base model (--arch transformer_wmt_en_de) instead of big model (--arch transformer_vaswani_wmt_en_de_big). You need to change to tiny batch size and huge update_cycle to run a big model on 2080ti cards.

rationalisa commented 5 years ago

@SunbowLiu Welcome to point out the possible mistake !

  1. For the architecture, I apologize for the typo. I double-check the script I used to run the back-translation, I did use --arch transformer_vaswani_wmt_en_de_big.
  2. We did use an update-frequency of 14 with 9 GPU to reproduce the setting of 128 GPU in original paper.
myleott commented 4 years ago

I've added a more detailed README for reproducing the BT results from the paper: https://github.com/pytorch/fairseq/tree/master/examples/backtranslation#training-your-own-model-wmt18-english-german