Can we use the same repository and readme Instructions to train the model on different Indic languages.(English to Indic)

AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India

https://ai4bharat.iitm.ac.in/indic-trans2

MIT License

214 stars 59 forks source link

Can we use the same repository and readme Instructions to train the model on different Indic languages.(English to Indic) #95

Closed Sab8605 closed 3 days ago

Sab8605 commented 3 weeks ago

First, thank you for the excellent work on this project—it has been invaluable for our tasks. Can we use same repository and architecture to train the model for different indic languages.(English to Indic) If yes then :-

If I want to train the model on English to another Indic language using the same architecture, could you provide advice on the hardware requirements needed, particularly considering the size of my dataset?
Any guidance on the recommended hardware specifications (e.g., GPU) for this training process would be greatly appreciated.

Thank you for your time and support!

PranjalChitale commented 3 weeks ago

Of course, you can use this repository.

For best performance, I recommend fine-tuning rather than training from scratch, especially if the languages you plan to work share script with the currently supported set.

As for hardware specifications, they will vary based on the scale of your data, the model architecture you choose, and whether you are fine-tuning or starting from scratch.

singhakr commented 3 weeks ago

So, if I want to train, or preferably finetune, on a language pair that is not supported, all I have to do is to specify the ISO codes for src_lang and tgt_lang as the options to the finetune.sh script? This seems the case to me. Could you please confirm?

PranjalChitale commented 3 weeks ago

If the language has sufficient coverage in the current vocabulary (say Devanagari languages like Awadhi, Bhojpuri, Magahi etc.), then there is no need for any vocabulary extension and the easiest approach is to edit out the fairseq dictionary (src side) and replace an unused token (you may find some junk tokens like Chinese characters, those are an optimal candidate for replacement) with the language tag you intend to use. And you would have to add this language code in FLORES_CODE_MAP_INDIC.

Post this you can preprocess the data by adding the same tag and finetune using the scripts provided in the repository.

singhakr commented 3 weeks ago

Though I have been able to run the example.py using the HF2 installation, I am not able to get the 'source install.sh' in IndicTrans2 itself.

It may be some issue related to Python version or the SLURM settings or Conda setting perhaps.

Is there a way to easily finetune on a different language from the HF2 installation?

Sab8605 commented 1 week ago

Thank you for inquiring about the details of the dataset and architecture.

For a new language translation, I have a dataset containing 100 to 110 million sentence pairs for English to the target language. The sentences have an average word count of 35, with a minimum of 7 words and a maximum of 55 to 65 words. Could you please guide me on the hardware requirements for fine-tuning a model on this dataset, specifically for this language pair, starting from scratch with same model architecture like IndicTrans2?

PranjalChitale commented 1 week ago

If you're only working with a single language pair, a 1 billion parameter model might be excessive and wasteful for your use case.

Instead, I recommend using our distilled model architecture(transformer_base18L), which could be more appropriate.

On 4 × A100 40Gb GPUs, it should achieve convergence in about 3-4 days.