Closed WZX1998 closed 4 years ago
- Translate DE News Crawl18 monolingual data using DE->EN modeI with interactive mode by setting beam size 5. We remove sentences with more than 10% of unk and finally got 10M synthetic data.
From the original paper: "For the backtranslation experiments we use the German monolingual newscrawl data distributed with WMT’18 comprising 226M sentences after removing duplicates."
Why do you only use 10M sentences?
Please also refer to Figure 5 in the original paper, which used 5M bitext and 24M synthetic. We don't see very large gains from backtranslation with beam search in this setting. You should try generating the backtranslations using sampling (--sampling
) instead of beam search.
Thanks for your reply!
Please also refer to Figure 5 in the original paper, which used 5M bitext and 24M synthetic. We don't see very large gains from backtranslation with beam search in this setting. You should try generating the backtranslations using sampling (
--sampling
) instead of beam search.
Thanks for emphasizing the effectiveness of sampling method. Definitely we would try sampling instead of beam search and larger amount of monolingual data to see more improvement.
However, according to Figure 1 in Understanding Back-Translation at Scale. The improvements of 11M and 17M total training data are about 0.8 and 0.7 respectively. Our experiment with 4.6M paired data and 10M synthetic data(15M in total) only results in a 0.16 improvement on WMT12.
It seems like a dismatch. Is my understanding correct?
Hi @WZX1998
Yes, 10M back translated sentences should definitely improve translation quality. There might be some issue in your setup, can you please check:
Hi @edunov I appreciate your time and help!
python train.py data-bin/16-ende --update-freq 14 --fp16 --source-lang en --target-lang de --arch transformer_wmt_en_de --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 3200 --reset-optimizer --ddp-backend no_c10d
SCRIPTS=mosesdecoder/scripts TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl CLEAN=$SCRIPTS/training/clean-corpus-n.perl NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl BPEROOT=subword-nmt CORPORA=( "news2018" )
src=de tgt=en lang=de-en prep=mono2018bpe tmp=$prep/tmp orig=mono2018
mkdir -p $orig $tmp $prep
echo "pre-processing train data..." for l in $src; do for f in "${CORPORA[@]}"; do cat $orig/$f.$l | \ perl $NORM_PUNC $l | \ perl $REM_NON_PRINT_CHAR | \ perl $TOKENIZER -threads 8 -l $l >> $tmp/train.$l done done
BPE_CODE=code
for L in $src; do for f in train.$L; do echo "apply_bpe.py to ${f}..." python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f done done
TEXT=data/mono2018 python3 preprocess.py --source-lang de --only-source --workers 24\ --trainpref $TEXT/train \ --destdir data-bin/mono18 --srcdict $TEXT/dict.de.txt
python3 interactive.py data-bin/16-ende --buffer-size 16 \ --input data/mono_de/train.de.txt --path checkpoints/transformer_deen/checkpoint_best.pt --beam 5 --source-lang de --target-lang en
TEXT=data/mono2018 CUDA_VISIBLE_DEVICES=0 python3.6 preprocess.py --source-lang en --target-lang de --workers 24 \ --trainpref $TEXT/trainmix \ --destdir data-bin/mono18/mix --srcdict data-bin/16-ende/dict.de.txt --tgtdict data-bin/16-ende/dict.de.txt
Baseline: 22.87(EN->DE) and 24.93(DE->EN). Back-translation: 23.03(EN->DE) for one trail did before, and 22.79 (EN->DE) for another trail did yesterday.
@rationalisa
The only obvious difference with our setup is "we removed sentences with more than 10% of unk". We didn't have this step. How many sentences like this did you have? Are these sentences in some other language?
Also, can you share your training log for backtranslation run?
@edunov
We downloaded Newscrawl18 : 38M647,678 -> remove 10% unk -> 38M638,664 ->remove len>250 -> 38M459,550. We did so by following Improving Neural Machine Translation Models with Monolingual Data in section 4.1. The unk step did not filter out many sentences and we sampled 10M from resulting 38M sentences.
Regarding the dataset, as your work mentioned
we use the German monolingual newscrawl data distributed with WMT’18 comprising 226M sentences after removing duplicates.
Did you mean you use all years data fromNewscrawl and then remove duplicated sentences without removing unk? Was the different amount of training data in figure 1 randomly sub-sampled from such 226M data?
We lost the log but here is the recovered record with first few epoch . backlog.txt
I was having poor translation quality also and after doing a sanity check realized that certain NE phrases were translated to gibberish. After spending a lot of time on it, I found that the cause of my problem was that I trained the dictionary and bpe codes on the monolingual dataset, which was much larger than the parallel dataset. Because of this, certain words in the dictionary was never used during the backtranslation training. When generating synthetic data, the model became confused because it encountered these tokens that it never saw during training.
I ultimately just used separate dict/bpe codes for backtranslation/forward translation, and it solved my problem. Anyways, may or may not be the cause of your issue but could be worth looking into.
Hi @Jack000 Our issue is different. We didn't train the dictionary and bpe codes on the monolingual dataset.
I ultimately just used separate dict/bpe codes for backtranslation/forward translation, and it solved my problem.
Since source sentences and target sentences share dict/bpe during standard en-de training, we believe back translation model and translation model can use same dict/bpe. I am confused.
@rationalisa from your train log it seems that your training loss goes down really well, but validation doesn't. I'd suspect overfitting, but the dropout is already 0.3 so it is unlikely.
Probably you have some train/validation data mismatch, e.g. they processed differently, or BPEed differently or train data is too simple for some other reason. It is very hard to say what exactly happens without access to the data, I wonder if you can upload text version somewhere?
And yes, we didn't remove unks, where do you see unks, on synthetic side or on the side of monolingual data?
Since source sentences and target sentences share dict/bpe during standard en-de training, we believe back translation model and translation model can use same dict/bpe. I am confused.
Since the monolingual data is much larger the parallel data there may be words/subwords that occur frequently in the monoligual data, but not the parallel data. Just as an example, if the data is scraped from the web there may be many &
tokens in the mono data but none in the parallel data. When you create the dict/bpe codes from the combined data, the&
token is included. When you train the backtranslation model, the model never sees &
because it's not in the parallel data. But when generating synthetic data for forward translation the model will encounter an &
token and produce random output.
This was an issue for me, but my task isn't en-de translation. Also entirely possible that it's a different issue altogether.
@Jack000 Probably different issues but thanks! @edunov We saw unk in monolingual data and here is our dataset: https://drive.google.com/drive/folders/17zILFRFmGFApT9-YS48QZmeXAhiE8pSA?usp=sharing
You use the transformer base model (--arch transformer_wmt_en_de) instead of big model (--arch transformer_vaswani_wmt_en_de_big). You need to change to tiny batch size and huge update_cycle to run a big model on 2080ti cards.
@SunbowLiu Welcome to point out the possible mistake !
--arch transformer_vaswani_wmt_en_de_big
.I've added a more detailed README for reproducing the BT results from the paper: https://github.com/pytorch/fairseq/tree/master/examples/backtranslation#training-your-own-model-wmt18-english-german
Hi, we tried to reproduce back translation on EN->DE following Understanding Back-Translation at Scale but the bleu score did not improve much by adding monolingual data. What we did:
The performance on other years are worse than the baseline models which does not match the result reported. Any idea? Thanks!