Distillation of en-indic base model

AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India

MIT License

214 stars 59 forks source link

Hey @prajdabre @PranjalChitale

I was going through the paper about the distillation of the models, but couldn't find the relevant source code in this repo. Are we directly using the following repo: https://github.com/VarunGumma/fairseq/tree/main?tab=readme-ov-file with following arguments:

--teacher-checkpoint-path $teacher_ckpt --task translation_with_kd --criterion label_smoothed_cross_entropy_with_kd --kd-args '{"strategy": "word_level"}'

In the paper it is mentioned that we use KL Divergence for the teacher student training. Moreover can you please comment more on Share decoder input output embed.

It would be really helpful if you can share the script/syntax for the training and getting the correct model architecture with weight initialisation.

Thanks!

AI4Bharat / IndicTrans2

Distillation of en-indic base model #74