Closed Sab8605 closed 3 days ago
Of course, you can use this repository.
For best performance, I recommend fine-tuning rather than training from scratch, especially if the languages you plan to work share script with the currently supported set.
As for hardware specifications, they will vary based on the scale of your data, the model architecture you choose, and whether you are fine-tuning or starting from scratch.
So, if I want to train, or preferably finetune, on a language pair that is not supported, all I have to do is to specify the ISO codes for src_lang and tgt_lang as the options to the finetune.sh script? This seems the case to me. Could you please confirm?
If the language has sufficient coverage in the current vocabulary (say Devanagari languages like Awadhi, Bhojpuri, Magahi etc.), then there is no need for any vocabulary extension and the easiest approach is to edit out the fairseq dictionary (src side) and replace an unused token (you may find some junk tokens like Chinese characters, those are an optimal candidate for replacement) with the language tag you intend to use. And you would have to add this language code in FLORES_CODE_MAP_INDIC.
Post this you can preprocess the data by adding the same tag and finetune using the scripts provided in the repository.
Though I have been able to run the example.py using the HF2 installation, I am not able to get the 'source install.sh' in IndicTrans2 itself.
It may be some issue related to Python version or the SLURM settings or Conda setting perhaps.
Is there a way to easily finetune on a different language from the HF2 installation?
Thank you for inquiring about the details of the dataset and architecture.
For a new language translation, I have a dataset containing 100 to 110 million sentence pairs for English to the target language. The sentences have an average word count of 35, with a minimum of 7 words and a maximum of 55 to 65 words. Could you please guide me on the hardware requirements for fine-tuning a model on this dataset, specifically for this language pair, starting from scratch with same model architecture like IndicTrans2?
If you're only working with a single language pair, a 1 billion parameter model might be excessive and wasteful for your use case.
Instead, I recommend using our distilled model architecture(transformer_base18L
), which could be more appropriate.
On 4 × A100 40Gb GPUs, it should achieve convergence in about 3-4 days.
First, thank you for the excellent work on this project—it has been invaluable for our tasks. Can we use same repository and architecture to train the model for different indic languages.(English to Indic) If yes then :-
Thank you for your time and support!