AI4Bharat / IndicVoices-R

A Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
Creative Commons Attribution 4.0 International
28 stars 1 forks source link

Training VoiceCraft with IndicVoices-R #5

Closed meets2tarun closed 1 month ago

meets2tarun commented 2 months ago

Hi,

Thanks for this outstanding work and for achieving this significant milestone in supporting Indian languages.

I would like to train voicecraft with all the languages in IndicVoices-R. I would be starting with the Hindi dataset with the voice-cloning functionality.

So far, to understand it, I tried to reproduce the training with the Gigaspeech XL dataset on the model.

It would be great if I get some help from @AshwinSankar17.

Thanks Tarun

AshwinSankar17 commented 2 months ago

The first step would be to extend the vocab to your grapheme sets if you do not want to use phonemes. You can use this script to do that. Alternatively, you can use the espeak tokenizer or the tokenizer from xphonebert if you want to continue using phonemes.

Next you need to make changes to phonemize_encodec_encode_hf.py to load huggingface dataset from json instead of pulling from the web. You may also need to change the tokenization approach here depending on your previous step.

Next in steps/trainer.py you need to extend the embedding table from the old embedding table (to somewhat benefit from the pretraining). This is what it would look like.

AshwinSankar17 commented 1 month ago

Closing this if there are no further questions.