MorinoseiMorizo / jparacrawl-finetune

An example usage of JParaCrawl pre-trained Neural Machine Translation (NMT) models.
http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/
102 stars 8 forks source link

Parameters for fine-tuning big model #8

Open leminhyen2 opened 2 years ago

leminhyen2 commented 2 years ago

Can you help me check if these parameters for fairseq are the same as when you tried big model to fine-tuned with JESC?

In comparison with your setting, I changed arch to transformer_vaswani_wmt_en_de_big, max-token to 2000 and update-freq to 40. This is based on this excerpt in the paper "For the big settings, we set the mini-batch size to 2,000 tokens and accumulated 160 mini-batches for updates"

Is this correct, or have I missed some details?

!python3 "/content/fairseq/train.py" "/content/training_process/preprocessed_data" \ --restore-file "$DIVE_LOCATION/models/ja_en/$MODEL_NAME_TO_CONTINUE_TRAINING/weights.pt" \ --arch transformer_vaswani_wmt_en_de_big \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 1.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 4000 \ --lr 0.001 \ --min-lr 1e-09 \ --dropout 0.3 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 2000 \ --max-update 28000 \ --save-dir "/content/training_process/model" \ --no-epoch-checkpoints \ --save-interval 10000000000 \ --validate-interval 1000000000 \ --save-interval-updates 100 \ --keep-interval-updates 8 \ --log-format simple \ --log-interval 5 \ --ddp-backend no_c10d \ --update-freq 40 \ --fp16 \ --seed 42

MorinoseiMorizo commented 2 years ago

I think your parameters are OK, but I have one thing to check. How many GPUs will you use for training? If you want to accumulate 160 mini-batches for an update, you might need to change update-freq. The number of accumulated mini-batches will be update-freq number of GPUs. Thus, If you use 4 GPUs, update-freq 40 is right (40 4 = 160). But if the number of GPUs is different, please change the number according to it.

leminhyen2 commented 2 years ago

Thanks, I will be using only 1 GPU on Google Colab. (it often a P100 but if I'm lucky I got V100)

What update-freq number should I changed to? Should I change --max-tokens too? And does mini-batches accumulation affect accuracy or does it only affect training speed?