CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.83k stars 8.8k forks source link

Training Voice Cloning model for another language #492

Closed rlutsyshyn closed 4 years ago

rlutsyshyn commented 4 years ago

Hi! I am already know how to train syntheiser and vocoder, also know how to create relevant dataset. But if I want to train voice cloning model for another language e.g.ukrainian, what else should I do?

ghost commented 4 years ago

Update synthesizer/utils/symbols.py to contain all valid characters in your text transcripts (the characters you want to train on). This is an example for Swedish: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/3eb96df1c6b4b3e46c28c6e75e699bffc6dd43be

However, be careful: in order for someone to run the model you've created they will also need to make the same changes to the file. I spent hours learning this the hard way trying to use the model in #257 because the creator was unavailable to help.

rlutsyshyn commented 4 years ago

Thank you very much! Will try :)

rlutsyshyn commented 4 years ago

Can you also tell me - can I somehow fine tune pretrained model on some new voice samples without full retraining?

ghost commented 4 years ago

Yes, you can resume training on a pretrained model using a different dataset. The main use for this is single-speaker finetuning (process and examples in #437) but you could also finetune multi-speaker using the same process.

One more thing to add, the speaker encoder is trained on English and may not work well for other languages. If you have a large number of voice samples in your target language, you may wish to train a new encoder or at least finetune an existing one. (Data preprocessing for encoder is not a smooth process so set your expectations accordingly).

There are some very good speaker encoders shared in #126 but the model size of 768 is too big to be practical for cloning. You can use this process to import the relevant weights from the model and finetune to a more useful dimension: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585

Ananas120 commented 4 years ago

Hello, i will also try to train a voice cloning model in another language (in Fr for me) and i have some tricks for you if it can help your :

Good luck for training !

ghost commented 4 years ago

the encoder is trained in english so don’t know if it is portable for other languages

The English encoder works all right for Swedish. There's info on setting it up and samples in #257 . Since encoder training is very intensive, you should just try it (either jump straight to synth preprocess and training, or do some speaker verification with Ukrainian utterances to see how well it performs).

rlutsyshyn commented 4 years ago

Thanks guys! Will try :)

ghost commented 4 years ago

@rlutsyshyn How is progress on your synthesizer model?

rlutsyshyn commented 4 years ago

@blue-fish Just collect a lot of data :)

afantasialiberal commented 4 years ago

Hello, i speak spanish, is there a tutorial for train it on my language? sorry i am a very noob with this but very fun project-

ghost commented 4 years ago

@afantasialiberal Please see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/431#issuecomment-673555684 for a general outline of the process. There is no tutorial available at this time.

rlutsyshyn commented 4 years ago

Hey, I have new issue while tried to run vocoder_preprocess. Preprocessing "starts" but it had 0 iterations (without any error) I have datasets/SV2TTS/vocoder/mels_gta but it is empty and datasets/SV2TTS/vocoder/synthesized.txt is also empty... Mb I missed something? I just fine tune pretrained model on my own data (with synthesizer there was no problems)

ghost commented 4 years ago

@rlutsyshyn Do you still have that issue with vocoder preprocess?

rlutsyshyn commented 4 years ago

@blue-fish have issue with synthesizer now :) I mean, that when I use 48kHz audio and calculate parameters in synthesizer/hparams.py - after fine tuning my voice is like in Alvin and the Chipmunks (very very fast) ... mb you have some advices on this case? What are the main parameters to configure to have normal voice in the output?

Ananas120 commented 4 years ago

Just to be sure, if you train the synthesizer to create 48khz melspectrogram, you should also train the vocoder to generate 48khz audio (because it’s trained on 16khz audio) Also you should check if the parameters for the audio player etc are well modified according your 48khz rate

Good luck !

rlutsyshyn commented 4 years ago

@Ananas120 For synthesizer in hparams.py I can modify win_size, hop... etc, but in vocoder/hparams.py I don't see something like that, so waht sould I modify to fine tune my vocoder for 48kHz data? Thnx :)

Ananas120 commented 4 years ago

Honnestly, i don’t know, i think blue-fish can help you better for this If the audio only seems to go to fast but seems good, it can only be a problem with the audio player rate and the no matter the rate of spectrogram for the vocoder (because i don’t know if it changes something for the vocoder if the spectrogram is a 16khz or 48khoz) So you could search where the toolbox uses something like sounddevice.play (sd.play) or something like that You could also check when the vocoder generates an audio and play it yourself with 48khz parameter (with IPython.display.Audio for example if you use jupyter notebook)

ghost commented 4 years ago

@rlutsyshyn You need to train a vocoder from scratch, the good news is that it trains relatively fast and you should only need to do it once. Most people choose sampling rates of 22.05 or 24 kHz for faster inference but that's your call.

In synthesizer hparams, you should modify hop_length to be 0.0125 sample_rate , and win_length and n_fft to be 4 times that number. The vocoder automatically picks up those hparams from the synthesizer. You'll also need to edit the upsampling factors in this line of code, to match your new hop length. For example, 5*5\8 = 200 (the default hop length for 16 kHz).

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/8f71d678d2457dffc4d07b52e75be11433313e15/vocoder/hparams.py#L26

When preprocessing data, the fmax can be adjusted. You can go as high as 0.5*sample rate (the Nyquist rate). Higher is not necessarily better, because we only have 80 mel channels and each channel needs to represent a wider range of frequencies. If you don't want to experiment, it is safe to leave fmax untouched at 7600 Hz.

rlutsyshyn commented 4 years ago

Thank for your response @blue-fish , but for training vocoder I need a lot of 48kHz data, what could be a problem. By the way:

  1. If I want to fine tune voice clonner on 22 or 24 kHz I also need to retrain vocoder from scratch?
  2. 5 5 8 = 200 but how can I split it for 600? 5 5 24 ? or smthing else?
ghost commented 4 years ago

Hi @rlutsyshyn, you don't need to use the same datasets for synth and vocoder. You can preprocess a different 48khz dataset (even English) and it should generalize to Ukrainian if it has enough voices (several hundred or more). Use synthesizer_preprocess_audio.py , then copy SV2TTS/synthesizer/mels to SV2TTS/vocoder/mels_gta and SV2TTS/synthesizer/train.txt to SV2TTS/vocoder/synthesized.txt.

The downside to this approach is your trained vocoder will not compensate for any deficiencies of your synthesizer model. It is a missed opportunity to make the final output better.

  1. You can continue to use Corentin's pretrained vocoder if your synthesizer hparams satisfy the following conditions.
    • num_mels = 80
    • (hop_size / sample_rate) = 0.0125
    • win_size = hop_size * 4
    • fmin = 55 and fmax = 7600

For proper vocoder inference, you either need to edit synthesizer/hparams.py or vocoder/hparams.py to set hop_size, win_size, and sample_rate to the old values (200, 800, 16khz). I don't know if it matters but you may also want to set n_fft=800. The toolbox uses the synthesizer's sampling rate, so easier to edit that hparams file (otherwise you need to resample the wav after getting it back from the vocoder).

The reason this works is because the vocoder just sees a 2d array of shape (num_mels, frames) as input. There is no sample rate information contained in the mel spectrogram. You can even go the other direction, and take a synthesizer trained at 16khz and use the mels on a vocoder trained at 24 khz :)

  1. I've personally tried (4, 4, 4, 4) for 256, and (5, 6, 10) for 300 and the results were good. Have not read the WaveRNN paper so I don't know how to select the upsampling factor. Maybe try (4, 5, 5, 6) for 600? An extra element does not add that many trainable parameters, or affect inference speed significantly.
rlutsyshyn commented 4 years ago

@blue-fish Thanks for your fast response, will try this :)

rlutsyshyn commented 4 years ago

@blue-fish Hey! Can you give me an advice? When I used data for fine tuning (16kHz english speaker) and fine tune only sysnthesizer after testing I had similar voice but words are like bla bla bla ... bla bla bla Is that problem with synthesizer or I have to train (fine tune) vocoder for that voice? Thnx

ghost commented 4 years ago

@rlutsyshyn Are you taking the pretrained synthesizer (English) and finetuning on your Ukrainian data? That's not going to work because the mapping of letters to sounds will not match. You need to start the synthesizer training from scratch when working with a new language.

Ananas120 commented 4 years ago

For the classic Tacotron-2 model, training from En to another language work (in Fr for me) but En and Fr sounds are not as far as that so i suppose mapping slightly differs but not as much For this model it doesn’t work but i think it’s not the fault of the pretrained weights but of my encoder or my dataset or my preprocessing

rlutsyshyn commented 4 years ago

@blue-fish @Ananas120 I used english synthesizer and try to fine rune on english data but recorded by my self. I collected 400 samples of utterances and try to fine tune synthesizer on them but had bla bla bla .

ghost commented 4 years ago

When finetuning, use the same embedding for all of your samples for faster convergence. I take the embedding of the first audio file and use it to overwrite all the others. For inference, make sure you load the same audio file used to generate your embeds for finetuning.

If it still doesn't work, check your preprocessing and also make sure the transcripts in train.txt matches what is spoken in the audio files.

rlutsyshyn commented 4 years ago

@blue-fish Can you explain this approach with same embedding more accurate, please?

Ananas120 commented 4 years ago

At the moment i use a « speaker-embedding » (the mean of all utterances embeddings), is it more interesting or is it better to user 1 single « real » utterance embedding for all ?

ghost commented 4 years ago

@rlutsyshyn You have 400 wav files in your training set for finetuning. When you run synthesizer_preprocess_embeds.py it will make embed-file1.npy, ... , embed-file400.npy, in SV2TTS/synthesizer/embeds. Copy the contents of file 1 to files 2-400, so that they are all the same.

@Ananas120 I use the embedding of a real utterance so I can load the audio file in the toolbox to get the desired embedding. The mean or L2-norm is technically better but with a good encoder model it shouldn't make much of a difference.

rlutsyshyn commented 4 years ago

@blue-fish

For inference, make sure you load the same audio file used to generate your embeds for finetuning.

what did you mean?

ghost commented 4 years ago

@rlutsyshyn After https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/492#issuecomment-695072250 , your entire dataset is using the embedding from file 1. The embedding corresponds to a specific audio file, let's call it file1.wav. When you test your new synthesizer in the toolbox (or demo_cli.py), you must remember to load file1.wav to generate the embedding.

rlutsyshyn commented 4 years ago

@blue-fish I tried to do what you said but I have still same results... bla bla bla. I checked datasets/SV2TTS/synthesizer/train.txt file and all is good there e.g.: audio-Track 1 - 218.npy|mel-Track 1 - 218.npy|embed-Track 1 - 218.npy|113367|567|Track 1 - 218|You humans who listened to the low notes from the tuba rated it as bittersweet.|You humans who listened to the low notes from the tuba rated it as bittersweet

I used first embedding for fine tuning model, and same embedding for inference in toolbox or demo_cli.py While I fine tuned the model loss was +-0.5 and won't fall more.

ghost commented 4 years ago

Your train.txt is improperly formatted. Here is an example line:

audio-p240_001.npy|mel-p240_001.npy|embed-p240_001.npy|38921|195|Please call Stella.
rlutsyshyn commented 4 years ago

@blue-fish thanks, now it works good. But how can I improve the quality of the output ?

ghost commented 4 years ago

@rlutsyshyn That's something that I continue to work on now. I am experimenting with different synthesizer models and settings, but I still have not surpassed the pretrained models from Corentin.

rlutsyshyn commented 4 years ago

@blue-fish Can the vocoder fine tuning improve output audio quality?

ghost commented 4 years ago

@rlutsyshyn Yes, though you'll want to make sure you are satisfied with the synthesizer before moving on to vocoder training.

rlutsyshyn commented 4 years ago

@blue-fish Yes, I think that I'm satisfied on the synthesizer model quality. But when I try to fine tune vocoder (on 16kHz data) on the output I listen just simple noise...

ghost commented 4 years ago

Closing this issue due to inactivity. @rlutsyshyn I think you know as much about this repo as I do now. My recommendation is to avoid finetuning the vocoder, since it will not improve the quality that much. If you need a better vocoder train it from scratch.

Adnan3234 commented 1 year ago

how many voice samples of a particular voice are required to train the model ?