CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.54k stars 8.79k forks source link

TTS outputing different words than the ones typed in #883

Closed Ca-ressemble-a-du-fake closed 2 years ago

Ca-ressemble-a-du-fake commented 2 years ago

Hi,

I am putting my hands on your fun project! Actually I am trying to clone a voice in French. I edited a short recording and made 16 extracts (22kHz mono 32 pcm Microsoft wav ranging from 1 to 5 seconds) out of it that I manually transcripted following the file hierarchy @blue-fish shows.

I also added some characters to utils/symbols. Then I launched the training with the command you gave : python3 synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

I let it go up to 22k steps because I did not know when to stop it (this guy suggests to stop it when the loss is less than 0.15) and because the SV2TTS folder in my datasets_root was not inflating. Indeed it seems that the command output the model in _synthesizer/saved_models/my_newrun.

I looked at the gene step-20000-mel-spectrogram_sample_1 rated mel-spectrograms (see attached) and it looked promising since ground truth and predicted looked quite alike (at least to my uninstructed eyes).

But when I tried the model out in the toolbox I got pity results. Indeed I input "Bonjour le monde" and it output another phrase (eg : "on va en profiter") that directly comes from the extracts.

So I know you advised for a 12 minute recording for single speaker training but the other guy I mentioned earlier had good results with 20 or 30 extracts (in English), consequently I took my chances with even less extracts just to have a quick starting point to later compare with.

Yet I feel disappointed because the results I got were unexpected to me. I would have expected bad quality audio but not completely different words ! Unless the word in input in TTS field and the one it output as wav have the closest embedding ?

It also looks like extracts of a duration less than around 1.53s (reported duration from audacity) are discarded. Is it expected and is it linked to the 1.6 utterance duration written in Corentin's thesis page 16 ?

Finally I could not train the vocoder because of missing mels_gta directory. I know you wrote that it was not necessary to train the vocider, but if I miss mels_gta directory maybe something went wrong during the training. Or is everything OK ?

Is it worth it to continue with editing 12 minutes or more from this voice or I made something wrong in the process ?

Can you help me out ?

ghost commented 2 years ago

Pretrained models only support English. You cannot finetune them to a different language.

If you want a French model, it needs to be trained from scratch with an appropriate dataset.

Ca-ressemble-a-du-fake commented 2 years ago

Thanks for your answer. Is it because French is too far from English that comment #492 is not valid or is it because you've noticed that using the English encoder for other languages eventually gives subpar results ?

ghost commented 2 years ago

What I meant is, train the synthesizer from scratch on a French dataset. The English encoder can be reused.

The English synthesizer was trained on 26,000 minutes of English data. You continued the synthesizer training on a custom dataset with only 1 minute of French data. That doesn't work because the French dataset is insufficient by a few orders of magnitude. Much more data is needed to learn French pronunciation. And if you have that much data, it is enough to train from scratch.

Ca-ressemble-a-du-fake commented 2 years ago

Thanks for this piece of information. So in order to clone a single voice in French am I on the right tracks with the following procedure :

By the way during synthesizer training are there parameters to monitor that tell the model is improving in quality ? When can I stop the training process ?

ghost commented 2 years ago

Now the run_ID parameter for synthesizer_train.py has to be pretrained, right ?

You can use anything for run_ID. You will need to delete the English pretrained model if you choose pretrained (because the model will not be compatible with your new symbols list).

use 0.2 hours of extracts from the target voice to fine tune the model for that voice.

When training the initial model, you can try including your target voice with the rest of the training data. Then it may not be necessary to finetune.

By the way during synthesizer training are there parameters to monitor that tell the model is improving in quality ? When can I stop the training process ?

For your first model from scratch, I suggest training an English synthesizer model using the instructions on the wiki page. The model backs up every 25k steps, try each of those to get an idea for how many training steps are necessary. The experience is useful to know if your French model is training well.

Ca-ressemble-a-du-fake commented 2 years ago

Which data structure should I adopt in my datasets. The one you showed in #437 or the one that is used in logs-singlespeaker ?

ghost commented 2 years ago

https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-666099538

Ca-ressemble-a-du-fake commented 2 years ago

Thanks again. I started training the synthesizer with @Ananas120 dataset (see #492 comment) since they reported to have good results. I encountered the following warning : Real-Time-Voice-Cloning-master/synthesizer/synthesizer_dataset.py:84: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) embeds = torch.tensor(embeds)

So I followed this piece of advice and replaced the pointed line with embeds = torch.tensor(np.array(embeds)). However the training is still done at a rate of 0.52 step / s (GPU is RTX 3070). So 25k steps will be reached in approx 13 hours. Did I make the correct decision to change the code with regards to the warning hint ? Why wasn't it changed on this repo ?

Will the results come faster using Google Colab Notebook (I've just discovered this technology while browsing the topic, sorry if the question is silly) ?

Ananas120 commented 2 years ago

Thanks again. I started training the synthesizer with @Ananas120 dataset (see #492 comment) since they reported to have good results. I encountered the following warning : Real-Time-Voice-Cloning-master/synthesizer/synthesizer_dataset.py:84: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.) embeds = torch.tensor(embeds)

So I followed this piece of advice and replaced the pointed line with embeds = torch.tensor(np.array(embeds)). However the training is still done at a rate of 0.52 step / s (GPU is RTX 3070). So 25k steps will be reached in approx 13 hours. Did I make the correct decision to change the code with regards to the warning hint ? Why wasn't it changed on this repo ?

Will the results come faster using Google Colab Notebook (I've just discovered this technology while browsing the topic, sorry if the question is silly) ?

It depends on multiple parameters such as the batch_size, whether you take into account the processing or not, ... For me performances were around 10sec / batch (batch_size 32) with the processing so yeah 2sec / epoch seems to be really good performances (for the Tacotron2 training)

ghost commented 2 years ago

@Ca-ressemble-a-du-fake Slow training is a known bug that we don't have a solution for at this time. Please contribute ideas and observations to #700

Ca-ressemble-a-du-fake commented 2 years ago

@Ananas120 what did you use for hardware ? I will try to increase batch-size from the default 12 to 32 and the r parameter to 4 as reported by @MGSousa in #700 (comment).

@blue-fish : did you encounter too this slow training ? What should be the rate value ?

Ca-ressemble-a-du-fake commented 2 years ago

I was not sure how to change the parameters for batch-size and r parameter. So I edited the hparams.py file and changed the second element from the array. Increasing r from 2 to 8 boosted the rate to 1.7 but after a while the program crashed. Now it is set to 4 and batch-size to 16 and the rate settled at around 1 step / s. Yet I don't know the consequences of these settings on the generated model.

This section of Nvidia Tacotron 2 repo indicates a batch size of 48 for Ampere GPU with FP32 (enabled by default) but I cannot reach this value (they use 40 GB GPU mine is only 8 GB).

ghost commented 2 years ago

The reduction factor r is explained in the Tacotron1 paper (1703.10135). You'll need to do your own experimentation to identify the best value for this parameter.

An important trick we discovered was predicting multiple, non-overlapping output frames at each decoder step. Predicting r frames at once divides the total number of decoder steps by r, which reduces model size, training time, and inference time.

Ca-ressemble-a-du-fake commented 2 years ago

That's interesting, thanks! I found that 20 was the maximum for r on my machine. After reading the paper I confidently set it to 16 with a batch-size of 1 and resumed the training with the new parameters (I just stopped and relaunched it from the last checkpoint). It ran overnight but when I hear the output I found the 25k model much better (more intelligible), than the 150k model. Does it demonstrate that my r setting is too high ? Do you recommend "gradual training" ?

ghost commented 2 years ago

I use the gradual training technique and recommend it. It's a feature of the original Tacotron1 implementation that we selected for this repo (see here) but was removed during development.

For the final merge of #472, I switched to constant r=2 and batch=12 to mimic the training conditions of my Tacotron2 model in #538. I did this to understand how Tacotron1 performed compared to the Tacotron2 model it replaced. I would have liked to change the default hparams back to gradual training, but did not have the time to train and benchmark a new pretrained model.

Ca-ressemble-a-du-fake commented 2 years ago

I changed the tts_cleaner_names to ["transliteration_cleaners"] in hparams.py file but the accented letters keep being replaced by their unaccented counterparts during training although train.txt file has all accented letters. Is it just a minor display issue or this filtered text is actually feeding the model ?

By the way is it ok to continue posting questions linked to my experiment on this thread or should a new question be opened each time the topic differs ?

Ananas120 commented 2 years ago

@Ca-ressemble-a-du-fake I used a GTX1070 with 6Go RAM and batch-size of 16 (no reduction factor) if I well remembered and also tried with RTX3080 with batch size of 32 and both give similar result I think (not tested rigorously)

You can see parameters I used as well as pretrained models for French and some experiments to voice cloning in French in the yui's github here where I contributed by adding my French pretrained model (trained on SIWIS) and an online demo on Google Colab

Note that the architecture used in this repo is the same (Tacotron2) but implemented in tensorflow 2.x with a different architecture (different number of layers) based on the NVIDIA pytorch open-sourced implementation

ghost commented 2 years ago

I changed the tts_cleaner_names to ["transliteration_cleaners"] in hparams.py file but the accented letters keep being replaced by their unaccented counterparts during training although train.txt file has all accented letters. Is it just a minor display issue or this filtered text is actually feeding the model ?

transliteration_cleaners converts a text string to ASCII which will remove accents from characters. If you want to keep the accents, use basic_cleaners and add cleaning features as needed.

By the way is it ok to continue posting questions linked to my experiment on this thread or should a new question be opened each time the topic differs ?

Open a new issue if you have a question or topic that will be useful to many users. Otherwise keep posting in this thread and we'll leave it open so long as it's active.

I generally only answer questions that I find interesting and/or benefit the whole community. Priority is given to individuals contributing to the repo's codebase or training a model that will be shared publicly.

Ca-ressemble-a-du-fake commented 2 years ago

@Ananas120 I currently have 2 s / step on an RTX 3070 and you had 2 s / epoch on a older GPU. That's a huge gap or you meant 2 s / step ?

Will the model be better with a higher batch_size or it does not matter ? With 8 GB GPU I cannot reach a batch_size of 32 so maybe it is worth it to use google colab as they offer GPU with more VRAM, isn't it ?

I tried your online demo but was stuck in cell 4 with ModuleNotFoundError: No module named 'models' I could download the model replacing gdrive_sh with gdown --id XYZ -O same path you used. I am not sure how to import this models module.

Ananas120 commented 2 years ago

@Ananas120 I currently have 2 s / step on an RTX 3070 and you had 2 s / epoch on a older GPU. That's a huge gap or you meant 2 s / step ?

Will the model be better with a higher batch_size or it does not matter ? With 8 GB GPU I cannot reach a batch_size of 32 so maybe it is worth it to use google colab as they offer GPU with more VRAM, isn't it ?

I tried your online demo but was stuck in cell 4 with ModuleNotFoundError: No module named 'models' I could download the model replacing gdrive_sh with gdown --id XYZ -O same path you used. I am not sure how to import this models module.

You can check the yui's repo to see training metrics (it's 7s/step without reduction factor, RTX3080 and batch_size 32) and 10s/step on my GTX (batch_size of 16)

Did you well cloned the github (1st cell with git clone) ? Because the module models is in the github actually

I left generated audio so you can listen to them without executing the code ;)

Ca-ressemble-a-du-fake commented 2 years ago

Ooops I missed all that! Thank you I'll check again!

Ca-ressemble-a-du-fake commented 2 years ago

I did exactly what I did earlier and now it is working ! Quality is pretty good from the online demo.

Ananas120 commented 2 years ago

Yeah indeed sometimes the cloning fail on colab, quite strange but anyway, thank you :D

Ca-ressemble-a-du-fake commented 2 years ago

@Ananas120 you used Mozilla CommonVoices among other to train the synthesizer. I don't know for all corpuses but at least in corpus 1 that I downloaded, I noticed that samples often contain silences at the beginning and at the end and I read somewhere (I can't find it now) that there must not have silences in these locations. Yet it does not seem that you try to trim those silence parts when you preprocess this dataset. So actually should one bother removing silences from CV ?

Ananas120 commented 2 years ago

You can find all preprocessing not in dataset loading but in the model itself here in the get_mel_input which call load_mel from utils.audio.audio_io file

ghost commented 2 years ago

Closing inactive issue. Feel free to continue the discussion.

bryant0918 commented 2 years ago

@Ca-ressemble-a-du-fake Would you mind sharing your trained French synthesizer?

Ca-ressemble-a-du-fake commented 2 years ago

@bryant0918 I have moved to CoquiTTS and haven't kept the things I had made with this repository since it was discontinued late last year.

Ananas120 commented 2 years ago

You can also check this repo on which I have shared my french models (single and multi-speaker)

Good luck !