Open leminhyen2 opened 2 years ago
I think your parameters are OK, but I have one thing to check.
How many GPUs will you use for training?
If you want to accumulate 160 mini-batches for an update, you might need to change update-freq.
The number of accumulated mini-batches will be update-freq
number of GPUs
.
Thus, If you use 4 GPUs, update-freq 40 is right (40 4 = 160).
But if the number of GPUs is different, please change the number according to it.
Thanks, I will be using only 1 GPU on Google Colab. (it often a P100 but if I'm lucky I got V100)
What update-freq number should I changed to? Should I change --max-tokens too? And does mini-batches accumulation affect accuracy or does it only affect training speed?
Can you help me check if these parameters for fairseq are the same as when you tried big model to fine-tuned with JESC?
In comparison with your setting, I changed arch to transformer_vaswani_wmt_en_de_big, max-token to 2000 and update-freq to 40. This is based on this excerpt in the paper "For the big settings, we set the mini-batch size to 2,000 tokens and accumulated 160 mini-batches for updates"
Is this correct, or have I missed some details?
!python3 "/content/fairseq/train.py" "/content/training_process/preprocessed_data" \ --restore-file "$DIVE_LOCATION/models/ja_en/$MODEL_NAME_TO_CONTINUE_TRAINING/weights.pt" \ --arch transformer_vaswani_wmt_en_de_big \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 1.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 4000 \ --lr 0.001 \ --min-lr 1e-09 \ --dropout 0.3 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 2000 \ --max-update 28000 \ --save-dir "/content/training_process/model" \ --no-epoch-checkpoints \ --save-interval 10000000000 \ --validate-interval 1000000000 \ --save-interval-updates 100 \ --keep-interval-updates 8 \ --log-format simple \ --log-interval 5 \ --ddp-backend no_c10d \ --update-freq 40 \ --fp16 \ --seed 42