Params for Training en-indic model

AI4Bharat / indicTrans

indicTranslate v1 - Machine Translation for 11 Indic languages. For latest v2, check: https://github.com/AI4Bharat/IndicTrans2

https://ai4bharat.iitm.ac.in/indic-trans

MIT License

119 stars 31 forks source link

Params for Training en-indic model #23

Closed TarunTater closed 3 years ago

TarunTater commented 3 years ago

We are trying to replicate the results from samantar indictrans paper. We are training the model for only en-hi translations. We are currently using these params following the paper : fairseq-train ../en_hi_4x/final_bin --max-source-positions=210 --max-target-positions=210 --save-interval-updates=10000 --arch=transformer_4x --criterion=label_smoothed_cross_entropy --source-lang=SRC --lr-scheduler=inverse_sqrt --target-lang=TGT --label-smoothing=0.1 --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 1.0 --warmup-init-lr 1e-07 --lr 0.0005 --warmup-updates 4000 --dropout 0.2 --save-dir ../en_hi_4x/model --keep-last-epochs 5 --patience 5 --skip-invalid-size-inputs-valid-test --fp16 --user-dir model_configs --wandb-project 'train_1' --max-tokens 300"

Can you please share the params you have used for training the en-indic model or specifically if you have tried en-hi separately?

gowtham1997 commented 3 years ago

hello,

We use the following command for the en-indic training.

fairseq-train <exp_dir folder>/final_bin \
--max-source-positions=210 \
--max-target-positions=210 \
--max-update=1000000 \
--save-interval=1 \
--arch=transformer_4x \
--criterion=label_smoothed_cross_entropy \
--source-lang=SRC \
--lr-scheduler=inverse_sqrt \
--target-lang=TGT \
--label-smoothing=0.1 \
--optimizer adam \
--adam-betas "(0.9, 0.98)" \
--clip-norm 1.0 \
--warmup-init-lr 1e-07 \
--lr 0.0005 \
--warmup-updates 4000 \
--dropout 0.2 \
--tensorboard-logdir <exp_dir folder>/tensorboard-wandb \
--save-dir <exp_dir folder>/model \
--keep-last-epochs 5 \
--patience 5 \
--skip-invalid-size-inputs-valid-test \
--fp16 \
--user-dir model_configs \
--wandb-project <project name> \
--update-freq=1 \
--distributed-world-size 4 \
--max-tokens 16384

^ for results in our paper, we ensured the effective batch size (max_tokens distributed_world_size update_freq) = ~64K. We haven't tried training 4x model only for en-hi

TarunTater commented 3 years ago

@gowtham1997 - thanks for sharing the params. any specific reason for this ? we ensured max_tokens * distributed_world_size * update_freq = ~64K. for the memory constrains ?

gowtham1997 commented 3 years ago

Sorry, I missed replying to this yesterday.

We observed that larger effective batch sizes utilized the GPUs fully and also showed better results in our initial experiments and hence, we chose ~64K. Effective batch sizes > 64K would also help but with time constraints in mind, we choose to use ~64K for our paper.

TarunTater commented 3 years ago

ohk.. got it. thank you for the info.