dropreg / R-Drop

867 stars 107 forks source link

Training configuration for the WMT14 EnDe dataset? #19

Closed frankang closed 2 years ago

frankang commented 2 years ago

Hi, I was trying to reproduce the result on the WMT14 EnDe dataset, but was unable to get BLEU increase as shown in the paper. Could you share the training script for that? Thanks!

dropreg commented 2 years ago

@frankang Hi, thanks for you attention. For reproduce our results, you have to make sure that:

  1. process dataset according to office issue: https://github.com/pytorch/fairseq/issues/202
  2. training script: --arch transformer_vaswani_wmt_en_de_big \ --share-all-embeddings \ --optimizer adam --lr 0.001 -s $src -t $tgt \ --label-smoothing 0.1 --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --max-tokens 4096 \ --update-freq 16 \ --lr-scheduler inverse_sqrt --weight-decay 0.0 \ --criterion reg_label_smoothed_cross_entropy \ --reg-alpha 5 \ --fp16 \ --max-update 300000 --warmup-updates 6000 --warmup-init-lr 1e-07 --adam-betas '(0.9,0.98)' \
  3. using script compound_split_bleu.sh and hyper-parameter "--beam 4 --lenpen 0.6" in inference.
  4. It may have slight performance differences when using different machines with us.
  5. Try other hyper-parameter (like --dropout = 0.2 --reg-alpha 3) may get better results. If you still have some problem, feel free to contact us.
frankang commented 2 years ago

@dropreg Thanks! BTW, how many GPUs do you use?

truebluejason commented 2 years ago

I would also like to know this as well @dropreg. Am I right to assume you used the default fairseq value for --distributed-world-size which is 1? What did you end up doing @frankang?

frankang commented 2 years ago

I tested it on a 4 GPU case and changed the value of --update-freq to 4. I got around 1 BLEU increase compared to the original transformer-big arch.

vince62s commented 8 months ago

@frankang when comparing both, did you train the two config the same number of steps, I mean de facto the drop reg config is trained twice the number of model forwards ?