CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.73k stars 8.8k forks source link

Training a new model based on LibriTTS #449

Closed ghost closed 4 years ago

ghost commented 4 years ago

@blue-fish, Would it be useful if I was to offer a GPU (2080 ti) for contributing on training a new model based on LibriTTS ? I have yet to train any models and would gladly exchange GPU time for an opportunity to learn. I wonder how long it would take on a single 2080 ti.

Originally posted by @mbdash in https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/441#issuecomment-663076421

ghost commented 4 years ago

@mbdash I just noticed this. This would be a really nice contribution if you are up for it!

On the pretrained models page it says the synthesizer was trained in a week on 4 GPUs (1080ti). If you are not willing to tie up your GPU for a full month, it will still be helpful if you can get to a partially-trained model that has intelligible speech so others can continue training and finetuning.

Training instructions for synthesizer

  1. Pull the latest copy of the repo to get LibriTTS support in #441.
  2. Download LibriTTS "train-clean-100" and "train-clean-360" from here: https://openslr.org/60/
    • While it is downloading, enable tensorflow GPU support if not already done
  3. Make a datasets folder, it can be on an external drive if you don't have enough storage (this will consume 150-200 GB)
  4. Extract LibriTTS downloads to this path: datasets/LibriTTS
  5. Generate mel spectrograms for training: python synthesizer_preprocess_audio.py path/to/datasets_folder --no_alignments --datasets_name LibriTTS
  6. Generate embeddings for training: python synthesizer_preprocess_embeds.py path/to/datasets_folder/SV2TTS/synthesizer
  7. Start training from scratch: python synthesizer_train.py new_model_name path/to/datasets_folder/SV2TTS/synthesizer
    • You will start seeing wavs when it reaches each checkpoint interval (default: 2,000 steps)

You can quit and resume training at any time, though you will lose all progress since the last checkpoint. It will be interesting to see how well it does with default hparams.

ghost commented 4 years ago

From what I understand, LibriTTS offers several advantages over LibriSpeech:

  1. The transcripts contain punctuation so the model will respond to it instead of ignoring it as it does currently.
  2. Audio has been split into smaller segments making alignments unnecessary
  3. Higher sampling rate of 24 kHz instead of 16 kHz

We should consider updating the hparams so we can ultimately generate 24 kHz audio from this:

@CorentinJ also suggests reducing the max allowable utterance duration (these hparams are used in synthesizer/preprocess.py): https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/054f16ecc186d8d4fa280a890a67418e6b9667a8/synthesizer/hparams.py#L95-L103

I don't have any solutions for the other suggestions mentioned (switching attention paradigm, removing speakers with bad prosody): https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443

mbdash commented 4 years ago

Ok, I will sync LibriTTS overnight, try to set this up over the weekend and get the GPU working on it.

mbdash commented 4 years ago

Update 2020-07-25 22h20 EST:

step 5 Generate mel spectrograms for training Currently at 25% all cpus available to VM are at full load.

For posterity, note typo in command in step 5, missing "s" in the flag "--datasets_name" python synthesizer_preprocess_audio.py ~/rtvc_LibriTTS/datasets --no_alignments --datasets_name LibriTTS

ghost commented 4 years ago

Thanks for the update and correction.

Let's run training with the default hparams. We're already switching from LibriSpeech to LibriTTS and it's best to only change one parameter at a time.

mbdash commented 4 years ago

Hi have an error because synthesizer_preprocess_embeds.py wants a pretrained model?

I fail to understand why we need to provide pre-trained data when trying to train from scratch, but i will stick in the latest pretrained model until told otherwise.

(rtvc_py373) username@vm:~/github/Real-Time-Voice-Cloning$ python synthesizer_preprocess_embeds.py /mnt/nfs/a_share/rtvc_LibriTTS/datasets/SV2TTS/synthesizer/
Arguments:
    synthesizer_root:      /mnt/nfs/a_share/rtvc_LibriTTS/datasets/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0%|                                                                                                                                                  | 0/111521 [00:02<?, ?utterances/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/username/github/Real-Time-Voice-Cloning/synthesizer/preprocess.py", line 228, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "/home/username/github/Real-Time-Voice-Cloning/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/site-packages/torch/serialization.py", line 384, in load
    f = f.open('rb')
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/pathlib.py", line 1186, in open
    opener=self._opener)
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/pathlib.py", line 1039, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))
  File "/home/username/github/Real-Time-Voice-Cloning/synthesizer/preprocess.py", line 254, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
ghost commented 4 years ago

@mbdash Look at the middle part of the image here and hopefully it will make more sense why the pretrained encoder model is needed to generate embeddings for synthesizer training: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/30#issuecomment-508381648 Please speak up if it still doesn't make sense.

Think of the synthesizer as a black box with 2 inputs: an embedding, and text to synthesize. Different speakers sound different even when speaking the same text. The synthesizer uses the embedding to impart that voice information in the mel spectrogram that it produces as output. The synthesizer gets the embedding from the encoder, which in turn can be thought of a black box that turns a speaker's wav data into an embedding.

So you need to run the encoder model to get the embedding, and you get the error message because it can't find the model.

mbdash commented 4 years ago

Ok, great, if you tell me it is as designed I will continue. it is currently at 50% Embedding.

I opened the image but I need slightly more coffee to really look at it ;-)

thx for the quick response.

mbdash commented 4 years ago

ok I started synthesizer_train.py and it is @ step 250 now @ 2020 07 26 12H24 EST

image

ghost commented 4 years ago

Wow that is fast. At that rate it will take just over 4 days to reach the 278k steps in the current model. And it will train even faster as the model gets better. Please share some griffin-lim wavs when they become intelligible.

mbdash commented 4 years ago

step 2850 @ 13H15 EST so approx 2500 steps in ~1h

ghost commented 4 years ago

Generated 64 train batches of size 36 in 21.814 sec

This seems to be a bottleneck, is the data on an external drive? I'm averaging about 14 sec for batch generation on a slow CPU but the data lives on a SSD.

mbdash commented 4 years ago

latest @ 14h25: image

My setup is not optimal. It is currently residing on the HDD side on my array, I just added a new SSD but is is not been used atm. When I stop the training, I will move the data on a share living on the SSD or even a passthrough NVME.

ghost commented 4 years ago

If that's a typical batch generation time now, 2.3 sec for 64 batches is just 0.036 sec per step or 1 hour over 100,000 steps. Not worth it to transfer the data over to the SSD in my opinion.

mbdash commented 4 years ago

step 10k reached @ 15h30 so we can estimate ~10k steps / 3h

Where are located the wavs you want me to share? When I try to ls datasets/SV2TTS/synthesizer/audio my terminal hang.

image

ghost commented 4 years ago

Where are located the wavs you want me to share?

Check out the training logs area: synthesizer/saved_models/logs-new_model_name/wavs

The files in the plots folder are also interesting and show how well the new synthesizer model is working.

mbdash commented 4 years ago

rtvc_libritts_s_mdl @ 10k steps

Cheers! rtvc_libritts_s_mdl_10k.zip

ghost commented 4 years ago

Overall, the synthesizer training seems to be progressing nicely! I'll be interested to see as many plots and wavs as you care to share, but otherwise it's a lot of waiting now.

It would be nice if you can share in-work checkpoints, say starting at 100k and every 50k steps after that. Or generate some samples using the toolbox. I've never trained from the start and it would be interesting to see the progression.

mbdash commented 4 years ago

rtvc_libritts_s_mdl @ 20k steps in ~6h

rtvc_libritts_s_mdl_20k.zip

ghost commented 4 years ago

I used the original pretrained models (hereafter, LibriSpeech_278k) to synthesize the same utterance as the 20k example, also inverting it with Griffin-Lim. The clarity is about the same but there is less harshness with LibriSpeech_278k (not sure what the correct technical term for that is).

"When he spoke of the execution he wanted to pass over the horrible details, but Natasha insisted that he should not omit anything."

You can definitely hear more of a pause after "details" in the 20k wav so the new model is learning how to deal with punctuation!

4592_22178_000024_000001.zip

mbdash commented 4 years ago

rtvc_libritts_s_mdl @ 74k steps in ~21h

rtvc_libritts_s_mdl_74k.zip

ghost commented 4 years ago

@mbdash From that batch I find the 50k sample remarkable. Your LibriTTS-based model is much closer to the ground truth, capturing the effect of the 3 commas and question mark on prosody.

For this one clip I say your model performs better than LibriSpeech_278k but it will be interesting to see how well the model generalizes to new voices (embeddings) unseen during training.

As they sat thus something brushed against peter as light as a kiss, and stayed there, as if saying timidly, "Can I be of any use?"

step-50000_comparison.zip

mbdash commented 4 years ago

Yes I keep listening to them paying attention to details and I can clearly ear the tts using the punctuation.

ghost commented 4 years ago

How long does it take to run each step now? Clearly it is progressing faster than 1.3-1.4 sec/step that is in the screenshot from yesterday.

mbdash commented 4 years ago

I don't think the numbers are very accurate

image

I try counting Mississippis but they pop / print way faster and sometimes in fast sequences image

ghost commented 4 years ago

It is a moving average of the last 100 steps:

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/054f16ecc186d8d4fa280a890a67418e6b9667a8/synthesizer/train.py#L165

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/054f16ecc186d8d4fa280a890a67418e6b9667a8/synthesizer/train.py#L207-L216

mbdash commented 4 years ago

102K reached in approx ~30h i think

rtvc_libritts_s_mdl_102k.zip

ghost commented 4 years ago

Can you make a backup of the 100k model checkpoint (or one that is in this range)? Just in case we want to come back to it later.

Is the average loss still coming down? Perhaps it converges much faster with LibriTTS. When I did the single-speaker finetuning on LibriSpeech p211 the synthesizer loss started at 0.70, and you are already in the 0.60-0.65 range.

mbdash commented 4 years ago

which files do you want me to backup so i don't mess this up? I don't want to loose any of that work. (117K now) I'll zip it and share.

ghost commented 4 years ago

The files in synthesizer/saved_models/logs-new_model_name/taco_pretrained

What we need is:

tacotron_model.ckpt-######.data-00000-of-00001
tacotron_model.ckpt-######.meta
tacotron_model.ckpt-######.index
checkpoint

Every time it reaches a new checkpoint interval it overwrites the oldest checkpoint. It's good to keep a few intermediate checkpoints in case something gets messed up along the way.

ghost commented 4 years ago

For the next synth model, I will update the code to include a few user-defined custom embedding parameters that are concatenated with the speaker embedding. These would all default to zero, but could be used to represent things like language or accent to faciltate fine-tuning and perhaps speed up training if the classification is known.

Currently, we cannot finetune an accent on the models in a way that generalizes to new speakers for voice cloning (see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-664704917). My hypothesis is that the accent is attributed to the speaker embedding (of the dataset used for finetuning), so it never generalizes. This would give us a tool to help get around that limitation.

Edit: This is essentially implementing Global Style Tokens: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/230 . Will use Mozilla's repo as a guide to follow.

mbdash commented 4 years ago

The training on the synth reached 200k, I stopped it to give a break to the server. image

I am still downloading the datasets for the encoder, i will get started on it tomorrow.

ghost commented 4 years ago

@mbdash Are you able to share the 200k checkpoint files or vocoded samples at the very least? I'd like to see how well the 200k model performs!

mbdash commented 4 years ago

Synth Trained on LibriTTS 200k steps with old /original encoder.

https://drive.google.com/drive/folders/1ah6QNyB8jIcFuKusPOVdx0pPIZxeZeul?usp=sharing

Let me know if the link works. or not and if any files are missing.

ghost commented 4 years ago

Thanks @mbdash ! I got it to work but needed to put it in a folder structure like this:

logs-LibriTTS_200k
    * taco_pretrained
        * checkpoint
        * tacotron_model.ckpt-200000.data-00000-of-00001
        * tacotron_model.ckpt-200000.index
        * tacotron_model.ckpt-200000.meta

The checkpoint is not included but it is easy enough to make it. It is a text file with a single line:

model_checkpoint_path: "tacotron_model.ckpt-200000"

So far I am finding cloned voices sound nearly identical to Corentin's LibriSpeech_278k model, with better performance for very short text inputs (1-5 words). It is still liable to have gaps, but they are not multiple seconds like we have with LibriSpeech_278k. The synthesizer can fail spectacularly, but this is a rare exception and not the norm. Some punctuation has an effect (periods and commas), but I don't notice anything with question marks. I think question marks would be better handled using a global style token like we are discussing in #230.

Overall an improvement over the existing model, though a slight one. This is all we could expect.

mbdash commented 4 years ago

Great to hear, I am still downloading the voxceleb files. Once done, i will train the encoder and we can try again training the synth from scratch.

ghost commented 4 years ago

If anyone else is silently following along I would appreciate any comments on the LibriTTS_200k model (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/449#issuecomment-665645153) so we can use that feedback to make the next one better.

shoegazerstella commented 4 years ago

Hi @mbdash and thank you for sharing this new synth model! I have tried it and it seems the voice is identical wrt the one generated by the old synth model. To me also the other was fairly similar to the input voice. However, I find it to be noisier compared to the old one. Do you think you can achieve better performances by also training the encoder? Is the noise due to some imperfections in the embedding computation phase?

ghost commented 4 years ago

@shoegazerstella Thank you for reporting the issue, would you please share some audio samples with us that demonstrate what you are talking about?

Just to speculate, the audio preprocessing could be adding noise or other artifacts into the sound files, it is worth doing a before and after comparison. LibriTTS is 24 KHz instead of the 16 KHz in LibriSpeech (used to train the original models), and since it's not an integer multiple this means our training data also needs to be interpolated as it is resampled. The librosa resampling process can be found in: librosa/core/audio.py (the actual resampling is done by scipy or resampy)

However I think that is unlikely. Could also be due to fewer training steps (200k vs 278k). Also LibriSpeech utterances are longer on average than LibriTTS so for a given number of steps I would expect a more refined model from LibriSpeech.

From https://arxiv.org/pdf/1904.02882v1.pdf

libritts_figure1
shoegazerstella commented 4 years ago

Hi @blue-fish , shure I can share some examples here:

Thank you for the explanation on the preprocessing steps! I have one question, was the model trained from scratch for LibriTTS or you started from a pre-trained model done on LibriSpeech_278? Do you think this approach could make sense for increasing its performances?

ghost commented 4 years ago

Thanks for sharing the samples @shoegazerstella ! The increased noise on LibriTTS_200k is quite obvious. In addition to more training I think it could also benefit from a new vocoder.

LibriTTS_200k is trained from scratch. We have several problems with LibriSpeech_278k, the most annoying of which is the long gaps that appear in the middle of spectrograms (#53). The training from scratch is part of an effort to fix these issues: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443 .

I think the next step is to lower max_mel_frames and find some way to clean up LibriTTS (probably calculate the ratio of wav to transcript lengths and removing outliers).

Switching to a pytorch-based synthesizer in #447 may also help since the Rayhane-mamah tacotron that we currently use has some known bugs that would go away by switching to fatchord's implementation in WaveRNN.

ghost commented 4 years ago

Would anyone else like to contribute a GPU to help develop a better synthesizer model? Reply here and get started by preprocessing LibriTTS: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/449#issuecomment-663785345

shoegazerstella commented 4 years ago

Hi @blue-fish We should be able to contribute too for retraining the model. The max we can use is v100 GPUs. I'll make some trials and see how many we can provide. How long do you think it would take? if not to fully complete, at least to achieve something you can after continue and finish? I am now downloading LibriTTS and will proceed with its preprocessing following the steps you suggested in the comment above, will let you know before starting the train so we can discuss if some hparam needs to be changed.

ghost commented 4 years ago

@shoegazerstella Thank you so much! It expect it will take 4-7 days to get a pretrained model for each config. Maybe half that if we're just testing hparams and not training to perfection. As a reference point, @mbdash trained LibriTTS_200k in just over 2 days on a 2080ti. Please download the torch-based synthesizer from #472. This will be our new code base which will eventually support global style tokens (#230).

Since putting out the request for help, I discovered that we will need a new vocoder so we should take this opportunity to increase the sample rate to 22,050 or 24,000 Hz. This will require preprocessing to be restarted, but we will get better audio quality in the end.

Do you need me to push the updated hparams to my fork, or do you prefer to figure it out yourself? Note the preprocessing scripts in #472 still reference the old synth, so you will need to modify the old synth's hparams for preprocessing.

ghost commented 4 years ago

I notice that at a low number of steps (say 25k), inference is very sensitive to trailing punctuation. For example Hello world (top plot) synthesizes with a lot of trailing emptiness, while Hello world. (bottom plot) cleanly terminates. The LibriTTS_200k model from @mbdash shows that it can be overcome with additional training, but I do not like this behavior.

helloworld

Now experimenting with stripping trailing punctuation which should use the end of sequence symbol "~") as an indication of when to stop, instead of the punctuation. If it works well I will add an hparam to ignore punctuation at the end of a text.

Also, now restricting the training set to 500 mel frames or less (default 900) to avoid long silences in the middle of utterances (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443). Here is the code snippet I use to post-process datasets_root/SV2TTS/synthesizer/train.txt to implement both of these changes:

from pathlib import Path
import string

with Path("train.txt").open("r") as metadata_file:
    metadata = [line.split("|") for line in metadata_file]
    max_frames = 500
    x0 = [x[0] for x in metadata if int(x[4])<=max_frames] # audio filename
    x1 = [x[1] for x in metadata if int(x[4])<=max_frames] # mel filename
    x2 = [x[2] for x in metadata if int(x[4])<=max_frames] # embed filename
    x3 = [x[3] for x in metadata if int(x[4])<=max_frames] # timesteps
    x4 = [x[4] for x in metadata if int(x[4])<=max_frames] # mel frames
    x5 = [x[5] for x in metadata if int(x[4])<=max_frames] # text

with Path("train_edit.txt").open("w") as output_file:
    for i in range(len(x0)):
        text = x5[i].strip().strip(string.punctuation) # first strip() removes newline
        output_file.write("|".join([x0[i], x1[i], x2[i], x3[i], x4[i], text]) + "\n") 
shoegazerstella commented 4 years ago

Hi @blue-fish So I cloned your fork,

Do you need me to push the updated hparams to my fork, or do you prefer to figure it out yourself? Note the preprocessing scripts in #472 still reference the old synth, so you will need to modify the old synth's hparams for preprocessing.

For preprocessing, I am modifying the hparams here, is that correct? I will change the sample rate to be 22,050. Do I need to also change hop and win_length accordingly? How can I figure out what values to assign?

Thanks!

ghost commented 4 years ago

Hi @shoegazerstella !

Something else I discovered since then, I made a mistake in how I was passing the data to the vocoder. Once I fixed it, I found that the original vocoder (16,000 Hz) works quite well. Since I am already training a model at 16,000 Hz, why don't you use 22,050 Hz for better quality? We don't have a 22,050 Hz vocoder model so it will be a nice contribution.

I have also had good results with changing max_mel_frames to 500. This has the following benefits:

  1. Shorter utterances are less likely to have long pauses
  2. Trains faster
  3. Requires less GPU memory, allowing larger batch sizes

What I am currently struggling with is punctuation. If my text has a comma, then my model introduces a 3-4 second pause. Additional training should fix it.

ghost commented 4 years ago

@shoegazerstella You might want to run synthesizer_train.py with -s 500 to save the model every 500 steps (that way you do not lose too much progress when stopping and restarting)

shoegazerstella commented 4 years ago

Hi @blue-fish thanks a lot for your help! Training is now in progress, the configuration follows the parameters you suggested above.

I had another little issue similar to https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/439#issuecomment-673349904, thus it seems it is processing 24353 samples only. Is that correct? thanks!

Initialising Tacotron Model...

Trainable Parameters: 24.888M

Starting the training of Tacotron from scratch

Using inputs from:
        DATA/SV2TTS/synthesizer/train.txt
        DATA/SV2TTS/synthesizer/mels
        DATA/SV2TTS/synthesizer/embeds

Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   10k Steps    |     32     |     0.001     |        7         |
+----------------+------------+---------------+------------------+

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py:211: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  self.dropout, self.training, self.bidirectional, self.batch_first)
| Epoch: 1/14 (762/762) | Loss: 0.8026 | 1.0 steps/s | Step: 0k | [[B
| Epoch: 2/14 (762/762) | Loss: 0.7637 | 1.0 steps/s | Step: 1k |
| Epoch: 3/14 (476/762) | Loss: 0.7511 | 1.0 steps/s | Step: 2k | Input at step 2000: my dear child, i said grandly, do you really suppose i am afraid of that poor wretch?~__________________________
| Epoch: 3/14 (762/762) | Loss: 0.7460 | 1.0 steps/s | Step: 2k |
| Epoch: 4/14 (361/762) | Loss: 0.7274 | 1.0 steps/s | Step: 2k | 
shoegazerstella commented 4 years ago

I restarted the training from scratch with the correct number of samples, I am now at step 8k. I will share later some spectrogram plots + wavs.