How to train a simple, vanilla transformers translation model from scratch with Fairseq

moyid commented 5 years ago

I have been familiarizing myself with the fairseq library recently, and have tried a couple of pretrained models. I thought that a good way to teach myself would be to train a plain vanilla transformers model with the data I have, and then I can modify and maybe add bells and whistles like pre-training from there. The fairseq documentation has an example of this with fconv architecture, and I basically would like to do the same with transformers.

Below is the code I tried:

In data preparation, I cleaned the data with moses script, tokenized words, and then applied BPE using subword-nmt, where I set number of BPE tokens to 15000.

For preprocessing:

fairseq-preprocess --source-lang zh --target-lang en \
    --trainpref data/train --validpref data/valid --testpref data/test \
    --joined-dictionary \
    --destdir data-bin \
    --workers 20

For training:

CUDA_VISIBLE_DEVICES=0,1,2,3

fairseq-train data-bin \
    --clip-norm 0.1 --dropout 0.2 --max-tokens 2048 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt \
    --criterion label_smoothed_cross_entropy \
    --lazy-load \
    --update-freq 4 \
    --keep-interval-updates 100 --save-interval-updates 3000  --log-interval 50 \
    --arch transformer --save-dir checkpoints/transformer

I trained this on a data set of ~19M samples, on 4 NVIDIA P100 GPUs, for about 8 hours -- at that point I had completed 1 epoch and a bit more. I tested this against my checkpoints -- for the first checkpoint at update 3000, the prediction was all "the the the"s -- but that might be ok because it was just the first checkpoint. However, I then tested this against the last checkpoint, and the prediction was the same sentence for all test samples!! -- The prediction was "committee on the peaceful uses of outer space" for everything, and the BLEU score was 0. My test set is not at all about outer space.

So after this extremely disappointing result, I realized that I should ask for some pointers on creating a basic transformers model:

First of all, is my result actually within expectation? The paper on which transformer.py is based, Jointly Learning to Align and Translate, stated that state of the art results are achieved on 64 Volta GPUs for 30k updates (!!!) -- my set up was much smaller, so maybe the result was expected? However, I have achieved better results in less time with less data, so I doubt that. Is it just that the learning rate was not set right so that it got stuck in some weird local minima? Or are there more things wrong with my setup above?
When would the above model stop? max_epoch and max_update are not required parameters and are set to math.inf when not given. From train.py, it looks like training goes on until learning rate gets to below args.min_lr, however I can't find where min_lr is set, and it is not a parameter in the documentation, so what is min_lr? Is it 0?
What is the best architecture to use for the "vanilla" transformer model that I'm looking for?

Thank you!

edunov commented 5 years ago

It will be easier to help you if you provide output logs for both training and preprocessing, there are many things that could have gone wrong. Your setup seems correct at a first glance, but the results seem too weak, so either model didn't train correctly or there were some issue with data processing or something else... Normally you'd see performance just a few BLEU points worse than final performance after 1 epoch on such a large dataset.

Did you by any chance train en-zh model instead of zh-en? Can you try scoring it in reverse direction?

Things that I'd suggest to try first:

remove --update-freq 4 you really only need it to push the batch size up to hopefully squeeze some bits of performance, but it slows down training a lot.
start with smaller models, e.g. --arch transformer_iwslt_de_en that will train a lot faster, and you will get a sense of how training should work
since you're doing Chinese, tokenization is very important, we used to use this tokenizer: https://github.com/fxsjy/jieba to get decent results. Are you using moses?

moyid commented 5 years ago

Thank you @edunov ! First, it is at least good to know that my set up seems correct.

remove --update-freq 4 you really only need it to push the batch size up to hopefully squeeze some bits of performance, but it slows down training a lot. I added this because I read that it speeds up training here ("Delayed updates can also improve training speed by reducing inter-GPU communication costs and by saving idle time caused by variance in workload across GPUs."), and I think that it did improve speed. I wasn't concerned with squeezing out more performance. I'm not concerned with that just yet, I just wanted to validate running a simple model.
- start with smaller models, e.g. --arch transformer_iwslt_de_en that will train a lot faster, and you will get a sense of how training should work I'll try this. I actually stayed away from that model because it seemed like a model customized to German > English translation, as the name implies, and I wanted a generic model. If it works for Chinese, that's great.
- since you're doing Chinese, tokenization is very important, we used to use this tokenizer: https://github.com/fxsjy/jieba to get decent results. Are you using moses? I also used Jieba for Chinese, and I used moses for English. After that I applied BPE.

Speaking of direction, I'm not sure how to generate in the opposite direction -- my generate script did not specify, and from the docs (here), I don't see parameters for source and target? I also did not specify source and target in the training script, likewise because I didn't see such an option. The option only existed in preprocessing. Should I specify source and target in my training script?

My generating script, btw, is here:

fairseq-generate data-bin \
    --gen-subset test \
    --path $1 \
    --beam 5 \
    --remove-bpe

where $1= path to checkpoint.

I'm re-running my script with logging so that I can upload that.

moyid commented 5 years ago

Here are the outputs that write to my screen while training. (I cannot find another log.) logs.txt

edunov commented 5 years ago

"Delayed updates can also improve training speed by reducing inter-GPU communication costs and by saving idle time caused by variance in workload across GPUs." - yes, but there are different notions of speed, with delayed updates you increase number of words per second, while what I suggested is to get 30k updates as quickly as possible (words per second speed will drop).

"it seemed like a model customized to German > English translation" - no, it is not customized for specific language direction, rather it is customized to certain dataset size.

Re direction: you can use --source-lang and --target-lang for both fairseq-generate and fairseq-train. So, can you try both ways for a model you generated?

Also, the log you've attached is only the beginning, can you please add entire training log? And also the output of fairseq-preprocess

moyid commented 5 years ago

Attached is the output of preprocessing preprocess-log.txt

I don't have more of the log. However, I trained a new model, this specifying source and target:

CUDA_VISIBLE_DEVICES=0,1,2,3

fairseq-train data-bin \
    --source-lang zh --target-lang en \
    --clip-norm 0.1 --dropout 0.2 --max-tokens 2048 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt \
    --criterion label_smoothed_cross_entropy \
    --lazy-load \
    --update-freq 4 \
    --log-format json \
    --keep-interval-updates 100 --save-interval-updates 3000  --log-interval 50 \
    --arch transformer --save-dir checkpoints/transformer

I stopped it after the first checkpoint, and used generate -- the hypotheses are all "the" repeated 200 times, for every test sentence. Attached is the output of generate:

first_checkpoint.txt

mozharovsky commented 4 years ago

@moyid, based on your preprocessing log I'd suggest trying to use two separate vocabs for en and zh pairs. It also makes sense to play around with numbers of words to retain in both source and target languages. Last but not least, you may wish to filter out long sentences if ones are present in your dataset.

As for the training process, choose a smaller portion of your dataset to optimize training hyper-parameters (especially, warmups and learning rates) of a transformer based model. Your results likely imply that the perplexity is too high which in its turn points out an optimization problem.

freddy5566 commented 4 years ago

Hi @moyid :

do you figure out why you got 0 on BLEU score, I follow the example and also got 0 I am wondering can you share your pre-processing script? Or give me some advice.

Thank you

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

cgr71ii commented 3 years ago

I've been dealing with the same problem, but for other language pair (et-en). I solved my problem adding the following flags to fairseq-train:

--lr-scheduler inverse_sqrt --warmup-updates 8000 --warmup-init-lr 1e-7

After these flags were added, the training started to work perfectly. Check out https://arxiv.org/pdf/1706.03762.pdf#optimizer and https://www.borealisai.com/en/blog/tutorial-17-transformers-iii-training/ --> "Learning rate warm-up"

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

nxpeng9235 commented 2 years ago

I've been dealing with the same problem, but for other language pair (et-en). I solved my problem adding the following flags to fairseq-train:
--lr-scheduler inverse_sqrt --warmup-updates 8000 --warmup-init-lr 1e-7
After these flags were added, the training started to work perfectly. Check out https://arxiv.org/pdf/1706.03762.pdf#optimizer and https://www.borealisai.com/en/blog/tutorial-17-transformers-iii-training/ --> "Learning rate warm-up"

Hi @cgr71ii

Did u encounter the issue with the same prediction in evaluation and getting 0 BLEU during training with fairseq? And is there any other trick to solve this issue other than increasing the warmup-updates steps and warmup-init-lr?

Thank you!

cgr71ii commented 2 years ago

Hi, @TheodorePeng,

Yes, the BLEU value was close to 0 in training and evaluation. The problem was related to the fact that I was not using a LR scheduler. As I said, you can check out the transformers paper and the other link. In the paper, the LR scheduler is mandatory, but they only mentioned the step and didn't give the necessary importance it has. In the article of the other link, you can see that a reason about why the LR scheduler is necessary is tried to be found. When you train a transformers, a LR scheduler is needed, I'm not sure why. So I solved the issue thanks, mainly, to --lr-scheduler inverse_sqrt, and the other flags are values you can try to optimize to get better results.

nxpeng9235 commented 2 years ago

Hi @cgr71ii

Thanks a lot but unfortunately it is not the case for me. My code failed because I built a new encoder with customized name and the SequenceGenerator was trying to call the forward() function of the original one. The LR scheduler and the warmup process, from my understanding, is needed because the training process is easily to get collapsed in the beginning stage if we set the LR too large.

facebookresearch / fairseq

How to train a simple, vanilla transformers translation model from scratch with Fairseq #1239