Closed rlutsyshyn closed 4 years ago
Update synthesizer/utils/symbols.py
to contain all valid characters in your text transcripts (the characters you want to train on). This is an example for Swedish: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/3eb96df1c6b4b3e46c28c6e75e699bffc6dd43be
However, be careful: in order for someone to run the model you've created they will also need to make the same changes to the file. I spent hours learning this the hard way trying to use the model in #257 because the creator was unavailable to help.
Thank you very much! Will try :)
Can you also tell me - can I somehow fine tune pretrained model on some new voice samples without full retraining?
Yes, you can resume training on a pretrained model using a different dataset. The main use for this is single-speaker finetuning (process and examples in #437) but you could also finetune multi-speaker using the same process.
One more thing to add, the speaker encoder is trained on English and may not work well for other languages. If you have a large number of voice samples in your target language, you may wish to train a new encoder or at least finetune an existing one. (Data preprocessing for encoder is not a smooth process so set your expectations accordingly).
There are some very good speaker encoders shared in #126 but the model size of 768 is too big to be practical for cloning. You can use this process to import the relevant weights from the model and finetune to a more useful dimension: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585
Hello, i will also try to train a voice cloning model in another language (in Fr for me) and i have some tricks for you if it can help your :
Good luck for training !
the encoder is trained in english so don’t know if it is portable for other languages
The English encoder works all right for Swedish. There's info on setting it up and samples in #257 . Since encoder training is very intensive, you should just try it (either jump straight to synth preprocess and training, or do some speaker verification with Ukrainian utterances to see how well it performs).
Thanks guys! Will try :)
@rlutsyshyn How is progress on your synthesizer model?
@blue-fish Just collect a lot of data :)
Hello, i speak spanish, is there a tutorial for train it on my language? sorry i am a very noob with this but very fun project-
@afantasialiberal Please see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/431#issuecomment-673555684 for a general outline of the process. There is no tutorial available at this time.
Hey, I have new issue while tried to run vocoder_preprocess. Preprocessing "starts" but it had 0 iterations (without any error) I have datasets/SV2TTS/vocoder/mels_gta but it is empty and datasets/SV2TTS/vocoder/synthesized.txt is also empty... Mb I missed something? I just fine tune pretrained model on my own data (with synthesizer there was no problems)
@rlutsyshyn Do you still have that issue with vocoder preprocess?
@blue-fish have issue with synthesizer now :) I mean, that when I use 48kHz audio and calculate parameters in synthesizer/hparams.py - after fine tuning my voice is like in Alvin and the Chipmunks (very very fast) ... mb you have some advices on this case? What are the main parameters to configure to have normal voice in the output?
Just to be sure, if you train the synthesizer to create 48khz melspectrogram, you should also train the vocoder to generate 48khz audio (because it’s trained on 16khz audio) Also you should check if the parameters for the audio player etc are well modified according your 48khz rate
Good luck !
@Ananas120 For synthesizer in hparams.py I can modify win_size, hop... etc, but in vocoder/hparams.py I don't see something like that, so waht sould I modify to fine tune my vocoder for 48kHz data? Thnx :)
Honnestly, i don’t know, i think blue-fish can help you better for this If the audio only seems to go to fast but seems good, it can only be a problem with the audio player rate and the no matter the rate of spectrogram for the vocoder (because i don’t know if it changes something for the vocoder if the spectrogram is a 16khz or 48khoz) So you could search where the toolbox uses something like sounddevice.play (sd.play) or something like that You could also check when the vocoder generates an audio and play it yourself with 48khz parameter (with IPython.display.Audio for example if you use jupyter notebook)
@rlutsyshyn You need to train a vocoder from scratch, the good news is that it trains relatively fast and you should only need to do it once. Most people choose sampling rates of 22.05 or 24 kHz for faster inference but that's your call.
In synthesizer hparams, you should modify hop_length
to be 0.0125 sample_rate , and win_length
and n_fft
to be 4 times that number. The vocoder automatically picks up those hparams from the synthesizer. You'll also need to edit the upsampling factors in this line of code, to match your new hop length. For example, 5*5\8 = 200 (the default hop length for 16 kHz).
When preprocessing data, the fmax
can be adjusted. You can go as high as 0.5*sample rate (the Nyquist rate). Higher is not necessarily better, because we only have 80 mel channels and each channel needs to represent a wider range of frequencies. If you don't want to experiment, it is safe to leave fmax untouched at 7600 Hz.
Thank for your response @blue-fish , but for training vocoder I need a lot of 48kHz data, what could be a problem. By the way:
Hi @rlutsyshyn, you don't need to use the same datasets for synth and vocoder. You can preprocess a different 48khz dataset (even English) and it should generalize to Ukrainian if it has enough voices (several hundred or more). Use synthesizer_preprocess_audio.py
, then copy SV2TTS/synthesizer/mels
to SV2TTS/vocoder/mels_gta
and SV2TTS/synthesizer/train.txt
to SV2TTS/vocoder/synthesized.txt
.
The downside to this approach is your trained vocoder will not compensate for any deficiencies of your synthesizer model. It is a missed opportunity to make the final output better.
For proper vocoder inference, you either need to edit synthesizer/hparams.py
or vocoder/hparams.py
to set hop_size, win_size, and sample_rate to the old values (200, 800, 16khz). I don't know if it matters but you may also want to set n_fft=800
. The toolbox uses the synthesizer's sampling rate, so easier to edit that hparams file (otherwise you need to resample the wav after getting it back from the vocoder).
The reason this works is because the vocoder just sees a 2d array of shape (num_mels, frames) as input. There is no sample rate information contained in the mel spectrogram. You can even go the other direction, and take a synthesizer trained at 16khz and use the mels on a vocoder trained at 24 khz :)
(4, 4, 4, 4)
for 256, and (5, 6, 10)
for 300 and the results were good. Have not read the WaveRNN paper so I don't know how to select the upsampling factor. Maybe try (4, 5, 5, 6)
for 600? An extra element does not add that many trainable parameters, or affect inference speed significantly. @blue-fish Thanks for your fast response, will try this :)
@blue-fish Hey! Can you give me an advice? When I used data for fine tuning (16kHz english speaker) and fine tune only sysnthesizer after testing I had similar voice but words are like bla bla bla ... bla bla bla Is that problem with synthesizer or I have to train (fine tune) vocoder for that voice? Thnx
@rlutsyshyn Are you taking the pretrained synthesizer (English) and finetuning on your Ukrainian data? That's not going to work because the mapping of letters to sounds will not match. You need to start the synthesizer training from scratch when working with a new language.
For the classic Tacotron-2 model, training from En to another language work (in Fr for me) but En and Fr sounds are not as far as that so i suppose mapping slightly differs but not as much For this model it doesn’t work but i think it’s not the fault of the pretrained weights but of my encoder or my dataset or my preprocessing
@blue-fish @Ananas120 I used english synthesizer and try to fine rune on english data but recorded by my self. I collected 400 samples of utterances and try to fine tune synthesizer on them but had bla bla bla .
When finetuning, use the same embedding for all of your samples for faster convergence. I take the embedding of the first audio file and use it to overwrite all the others. For inference, make sure you load the same audio file used to generate your embeds for finetuning.
If it still doesn't work, check your preprocessing and also make sure the transcripts in train.txt
matches what is spoken in the audio files.
@blue-fish Can you explain this approach with same embedding more accurate, please?
At the moment i use a « speaker-embedding » (the mean of all utterances embeddings), is it more interesting or is it better to user 1 single « real » utterance embedding for all ?
@rlutsyshyn You have 400 wav files in your training set for finetuning. When you run synthesizer_preprocess_embeds.py
it will make embed-file1.npy, ... , embed-file400.npy, in SV2TTS/synthesizer/embeds
. Copy the contents of file 1 to files 2-400, so that they are all the same.
@Ananas120 I use the embedding of a real utterance so I can load the audio file in the toolbox to get the desired embedding. The mean or L2-norm is technically better but with a good encoder model it shouldn't make much of a difference.
@blue-fish
For inference, make sure you load the same audio file used to generate your embeds for finetuning.
what did you mean?
@rlutsyshyn After https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/492#issuecomment-695072250 , your entire dataset is using the embedding from file 1. The embedding corresponds to a specific audio file, let's call it file1.wav. When you test your new synthesizer in the toolbox (or demo_cli.py), you must remember to load file1.wav to generate the embedding.
@blue-fish I tried to do what you said but I have still same results... bla bla bla. I checked datasets/SV2TTS/synthesizer/train.txt file and all is good there e.g.:
audio-Track 1 - 218.npy|mel-Track 1 - 218.npy|embed-Track 1 - 218.npy|113367|567|Track 1 - 218|You humans who listened to the low notes from the tuba rated it as bittersweet.|You humans who listened to the low notes from the tuba rated it as bittersweet
I used first embedding for fine tuning model, and same embedding for inference in toolbox or demo_cli.py While I fine tuned the model loss was +-0.5 and won't fall more.
Your train.txt is improperly formatted. Here is an example line:
audio-p240_001.npy|mel-p240_001.npy|embed-p240_001.npy|38921|195|Please call Stella.
@blue-fish thanks, now it works good. But how can I improve the quality of the output ?
@rlutsyshyn That's something that I continue to work on now. I am experimenting with different synthesizer models and settings, but I still have not surpassed the pretrained models from Corentin.
@blue-fish Can the vocoder fine tuning improve output audio quality?
@rlutsyshyn Yes, though you'll want to make sure you are satisfied with the synthesizer before moving on to vocoder training.
@blue-fish Yes, I think that I'm satisfied on the synthesizer model quality. But when I try to fine tune vocoder (on 16kHz data) on the output I listen just simple noise...
Closing this issue due to inactivity. @rlutsyshyn I think you know as much about this repo as I do now. My recommendation is to avoid finetuning the vocoder, since it will not improve the quality that much. If you need a better vocoder train it from scratch.
how many voice samples of a particular voice are required to train the model ?
Hi! I am already know how to train syntheiser and vocoder, also know how to create relevant dataset. But if I want to train voice cloning model for another language e.g.ukrainian, what else should I do?