Closed hetpandya closed 3 years ago
Hi @thehetpandya , please start by reading this: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/431#issuecomment-673555684
It is not possible to finetune the English model to another language. A new model needs to be trained from scratch. This is because the model relates the letters of the alphabet to their associated sounds, so what the model knows about English does not transfer over to Hindi. At a minimum, you will need a total of 50 hours of transcribed speech from at least 100 speakers. For a better model get 10 times this number.
This is what you need to do. Good luck and have fun!
@blue-fish Thanks a lot for the response! Yes, I have begun exploring the issues for now for a better understanding of the workflow before beginning with the training process.
I also read on #492 (comment) that beginning with training the synthesizer would be a good start and then only if the encoder doesn't seem to give proper results, one can proceed with training/fine-tuning the encoder. Does that same apply in case for a totally different language too, like in my case i.e. hindi?
I agree with that suggestion. Encoder training requires a lot of data, time and effort. You can see #126 and #458 to get an idea. If your results are good enough without it, best to avoid that hassle.
@thehetpandya
I'm working on a forked version of sv2tts to train local dialect of chinese. Using the dataset from Common Voice (about 22k of utterances) , i couldn't get the data to converge. But if I add the local dialect on top of a pre-trained model (the main dialect of chinese), seems like the result is actually quite good. Fyi, the local dialect and the main dialect have different, but similar alphabet romanization system (for example, the main has 4 tones, but the local dialect has 8)
using common voice data only:
using pre-trained and then add local dataset:
@blue-fish not sure if i'm abusing the model, but at least it works :)
@lawrence124 Interesting, thanks for sharing that result! Occasionally the model fails to learn attention, you might try restarting the training from scratch with a different random seed. It might also help to trim the starting and ending silences. If your data is at 16 kHz then webrtcvad can do that for you (see the trim_long_silences function in encoder/audio.py).
Thanks @blue-fish I went through the issues you mentioned. You gave me a good amount of resources for a start. Much appreciated!
@lawrence124 Glad to see your results! Did you have to train the encoder from scratch? Or using the pre-trained decoder/synthesizer worked for you?
i'm using the pretrained encoder from Kuangdd, but according to the file size and date...seems like it is the same as the pretrained encoder from here
Okay, thanks @lawrence124 ! Seems like using the pretrained encoder is good to go for now.
btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on.
@blue-fish
would like to ask a rather random question...have u tried using the demo TTS from https://www.readspeaker.com/ ??
from my point of view, the result in Chinese/Cantonese is pretty good and i would like to discuss...is that their proprietary algorithm is simply superior ?? or they simply has the resources to build a better dataset to train on ??
based on the job description, what they are doing is not too different from tacotron / sv2tts
https://www.isca-speech.org/iscapad/iscapad.php?module=article&id=17363&back=p,250
@lawrence124 That website demo uses an different algorithm that probably does not involve machine learning. It sounds like a concatenative method of synthesis where prerecorded sounds are joined together. Listening closely, it is unnatural and obviously computer-generated. To their credit, they do use high-quality audio samples to build the output.
Here's a wav of the demo text synthesized by zhrtvc, using Griffin-Lim as the vocoder. Tacotron speech flows a lot more smoothly than their demo. zhrtvc could sound better than the demo TTS if 1) it is trained on higher quality audio, and 2) a properly configured vocoder is available.
@blue-fish yea, as with other data analysis....getting the good/clean dataset is always difficult. (the prelim result of adding youtube clips is not good)
20200915-204053_melgan_10240ms.zip
This is an example of using "mandarin + cantonese" as synthesizer, along with Melgan vocoder. I dont know if it is my ear or not, i dont really like the Griffin-Lim from zhrtvc, it has the "robotic" noise in the background.
btw, seems like u are updating the synthesizer of sv2tts ?? the backbone is still tacotron ??
btw, i modified a script from adueck a bit. This script will convert video/audio with srt to audio with script for training. I'm not quite sure the format for sv2tts though, but i think u may find it useful if u are trying to get some more date set to train on.
@lawrence124 thanks I shall take a look at it since I might need more data if I cannot find any public dataset
@thehetpandya were you able to generate the model for cloning hindi sentences?
@GauriDhande I'm still looking for a good hindi speech dataset. Do you have any sources?
Was going to ask the same thing. I didn't find the Hindi open speech dataset on the internet yet.
You might be able the combine the two sources below. First train a single-speaker model on source 1, then tune the voice cloning aspect on source 2. Some effort and experimentation will be required.
Source 1 (24 hours single-speaker): https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages Source 2 (100 voices, 6 utterances each, untranscribed): https://github.com/shivam-shukla/Speech-Dataset-in-Hindi-Language
Thanks @blue-fish, I've already applied for Source 1. Will also check out the second one. Your efforts on this project are much appreciated!
Hi @thehetpandya , have you made any progress on this recently?
Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words. Edit - I tried fine-tuning https://github.com/Kyubyong/dc_tts on Source 1 i.e. https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages
Thanks for trying @thehetpandya . If you decide to work on this later please reopen the issue and I'll try to help.
Greetings @thehetpandya
Are you able to do a real time voice cloning for the given english text, with your experiment in indian accent?
Could you please help/guide me with Voice cloning of english Text In My voice with indian accent?
Thanks
Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help.
Hi @amrahsmaytas no I couldn't land to good results. And then I had to shift to another task. Still I'd be glad if I cloud be of any help.
Thanks for the reply,het! I need your help in training, could you please check your mail (send from greetsatyamsharma@gmail.com) and connect me over there for further discussions!
Thanks ✌, Awaiting for your response, dude Satyam.
@GauriDhande and @thehetpandya were you guys able to generate the model for cloning Hindi sentences? Please reply.
Thanks.
Hi @rajuc110 , sorry for the delayed response. No, I couldn't reproduce the results in hindi and had to shift to another task meanwhile.
Hi @blue-fish , no I coudn't find progress on this one. I tried fine-tuning https://github.com/Kyubyong/dc_tts instead, which gave clearer pronunciation of hindi words. Edit - I tried fine-tuning https://github.com/Kyubyong/dc_tts on Source 1 i.e. https://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages
Can you share your work?
I am also facing this issue has anyone have update on this issue
Hey guys, has anyone found a solution for hindi voice cloning? Thanks
Anybody has already trained model for Hindi language?
any progress done on training real time voice cloning on hindi data set ?
Hi @blue-fish , I am trying to fine-tune the model to clone voices of hindi speakers. I wanted to know the steps to follow for the same and also the amount of data I'd need for the model to work well.
Edit - I shall use google colab for fine-tuning