CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.36k stars 8.76k forks source link

Fine-tuning for hindi #525

Closed hetpandya closed 3 years ago

hetpandya commented 4 years ago

Hi @blue-fish , I am trying to fine-tune the model to clone voices of hindi speakers. I wanted to know the steps to follow for the same and also the amount of data I'd need for the model to work well.

Edit - I shall use google colab for fine-tuning

ghost commented 4 years ago

Hi @thehetpandya , please start by reading this: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/431#issuecomment-673555684

It is not possible to finetune the English model to another language. A new model needs to be trained from scratch. This is because the model relates the letters of the alphabet to their associated sounds, so what the model knows about English does not transfer over to Hindi. At a minimum, you will need a total of 50 hours of transcribed speech from at least 100 speakers. For a better model get 10 times this number.

This is what you need to do. Good luck and have fun!

  1. Replicate the training of the English synthesizer to learn how to use the data processing and training scripts.
    • I have no idea how to do this with Google colab, but it should be possible.
  2. Assemble and preprocess your dataset
  3. Train a synthesizer model
  4. Troubleshoot problems with the model.
  5. Repeat steps 3 and 4 until satisfied
hetpandya commented 4 years ago

@blue-fish Thanks a lot for the response! Yes, I have begun exploring the issues for now for a better understanding of the workflow before beginning with the training process.

I also read on #492 (comment) that beginning with training the synthesizer would be a good start and then only if the encoder doesn't seem to give proper results, one can proceed with training/fine-tuning the encoder. Does that same apply in case for a totally different language too, like in my case i.e. hindi?

ghost commented 4 years ago

I agree with that suggestion. Encoder training requires a lot of data, time and effort. You can see #126 and #458 to get an idea. If your results are good enough without it, best to avoid that hassle.

lawrence124 commented 4 years ago

@thehetpandya

I'm working on a forked version of sv2tts to train local dialect of chinese. Using the dataset from Common Voice (about 22k of utterances) , i couldn't get the data to converge. But if I add the local dialect on top of a pre-trained model (the main dialect of chinese), seems like the result is actually quite good. Fyi, the local dialect and the main dialect have different, but similar alphabet romanization system (for example, the main has 4 tones, but the local dialect has 8)

using common voice data only: image

using pre-trained and then add local dataset: image

@blue-fish not sure if i'm abusing the model, but at least it works :)

ghost commented 4 years ago

@lawrence124 Interesting, thanks for sharing that result! Occasionally the model fails to learn attention, you might try restarting the training from scratch with a different random seed. It might also help to trim the starting and ending silences. If your data is at 16 kHz then webrtcvad can do that for you (see the trim_long_silences function in encoder/audio.py).

hetpandya commented 4 years ago

Thanks @blue-fish I went through the issues you mentioned. You gave me a good amount of resources for a start. Much appreciated!

hetpandya commented 4 years ago

@lawrence124 Glad to see your results! Did you have to train the encoder from scratch? Or using the pre-trained decoder/synthesizer worked for you?

lawrence124 commented 4 years ago

i'm using the pretrained encoder from Kuangdd, but according to the file size and date...seems like it is the same as the pretrained encoder from here

hetpandya commented 4 years ago

Okay, thanks @lawrence124 ! Seems like using the pretrained encoder is good to go for now.

lawrence124 commented 4 years ago

btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on.

https://github.com/adueck/split-video-by-srt

srt-split.zip

lawrence124 commented 4 years ago

@blue-fish

would like to ask a rather random question...have u tried using the demo TTS from https://www.readspeaker.com/ ??

from my point of view, the result in Chinese/Cantonese is pretty good and i would like to discuss...is that their proprietary algorithm is simply superior ?? or they simply has the resources to build a better dataset to train on ??

based on the job description, what they are doing is not too different from tacotron / sv2tts

https://www.isca-speech.org/iscapad/iscapad.php?module=article&id=17363&back=p,250

ghost commented 4 years ago

@lawrence124 That website demo uses an different algorithm that probably does not involve machine learning. It sounds like a concatenative method of synthesis where prerecorded sounds are joined together. Listening closely, it is unnatural and obviously computer-generated. To their credit, they do use high-quality audio samples to build the output.

Here's a wav of the demo text synthesized by zhrtvc, using Griffin-Lim as the vocoder. Tacotron speech flows a lot more smoothly than their demo. zhrtvc could sound better than the demo TTS if 1) it is trained on higher quality audio, and 2) a properly configured vocoder is available.

lawrence124 commented 4 years ago

@blue-fish yea, as with other data analysis....getting the good/clean dataset is always difficult. (the prelim result of adding youtube clips is not good)

20200915-204053_melgan_10240ms.zip

This is an example of using "mandarin + cantonese" as synthesizer, along with Melgan vocoder. I dont know if it is my ear or not, i dont really like the Griffin-Lim from zhrtvc, it has the "robotic" noise in the background.

btw, seems like u are updating the synthesizer of sv2tts ?? the backbone is still tacotron ??

hetpandya commented 4 years ago

btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on.

https://github.com/adueck/split-video-by-srt

srt-split.zip

@lawrence124 thanks I shall take a look at it since I might need more data if I cannot find any public dataset

GauriDhande commented 4 years ago

@thehetpandya were you able to generate the model for cloning hindi sentences?

hetpandya commented 4 years ago

@GauriDhande I'm still looking for a good hindi speech dataset. Do you have any sources?

GauriDhande commented 4 years ago

Was going to ask the same thing. I didn't find the Hindi open speech dataset on the internet yet.

ghost commented 4 years ago

You might be able the combine the two sources below. First train a single-speaker model on source 1, then tune the voice cloning aspect on source 2. Some effort and experimentation will be required.

Source 1 (24 hours single-speaker): https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages Source 2 (100 voices, 6 utterances each, untranscribed): https://github.com/shivam-shukla/Speech-Dataset-in-Hindi-Language

hetpandya commented 4 years ago

Thanks @blue-fish, I've already applied for Source 1. Will also check out the second one. Your efforts on this project are much appreciated!

ghost commented 4 years ago

Hi @thehetpandya , have you made any progress on this recently?

hetpandya commented 4 years ago

Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words. Edit - I tried fine-tuning https://github.com/Kyubyong/dc_tts on Source 1 i.e. https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages

ghost commented 3 years ago

Thanks for trying @thehetpandya . If you decide to work on this later please reopen the issue and I'll try to help.

amrahsmaytas commented 3 years ago

Greetings @thehetpandya

Are you able to do a real time voice cloning for the given english text, with your experiment in indian accent?

Could you please help/guide me with Voice cloning of english Text In My voice with indian accent?

Thanks

hetpandya commented 3 years ago

Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help.

amrahsmaytas commented 3 years ago

Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help.

Thanks for the reply,het! I need your help in training, could you please check your mail (send from greetsatyamsharma@gmail.com) and connect me over there for further discussions!

Thanks ✌, Awaiting for your response, dude Satyam.

rajuc110 commented 3 years ago

@GauriDhande and @thehetpandya were you guys able to generate the model for cloning Hindi sentences? Please reply.

Thanks.

hetpandya commented 3 years ago

Hi @rajuc110 , sorry for the delayed response. No, I couldn't reproduce the results in hindi and had to shift to another task meanwhile.

SayaliNagwkar17 commented 2 years ago

Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words. Edit - I tried fine-tuning https://github.com/Kyubyong/dc_tts on Source 1 i.e. https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages

Can you share your work?

SohumKaliaCoder commented 1 year ago

I am also facing this issue has anyone have update on this issue

divyendrajadoun commented 1 year ago

Hey guys, has anyone found a solution for hindi voice cloning? Thanks

Harsh-Holy9 commented 5 months ago

Anybody has already trained model for Hindi language?

Chetan-5ehgal commented 5 months ago

any progress done on training real time voice cloning on hindi data set ?