Closed frankang closed 2 years ago
@frankang Hi, thanks for you attention. For reproduce our results, you have to make sure that:
--arch transformer_vaswani_wmt_en_de_big \ --share-all-embeddings \ --optimizer adam --lr 0.001 -s $src -t $tgt \ --label-smoothing 0.1 --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --max-tokens 4096 \ --update-freq 16 \ --lr-scheduler inverse_sqrt --weight-decay 0.0 \ --criterion reg_label_smoothed_cross_entropy \ --reg-alpha 5 \ --fp16 \ --max-update 300000 --warmup-updates 6000 --warmup-init-lr 1e-07 --adam-betas '(0.9,0.98)' \
@dropreg Thanks! BTW, how many GPUs do you use?
I would also like to know this as well @dropreg. Am I right to assume you used the default fairseq value for --distributed-world-size which is 1? What did you end up doing @frankang?
I tested it on a 4 GPU case and changed the value of --update-freq
to 4. I got around 1 BLEU increase compared to the original transformer-big arch.
@frankang when comparing both, did you train the two config the same number of steps, I mean de facto the drop reg config is trained twice the number of model forwards ?
Hi, I was trying to reproduce the result on the WMT14 EnDe dataset, but was unable to get BLEU increase as shown in the paper. Could you share the training script for that? Thanks!