CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.14k stars 8.72k forks source link

Training a new encoder model #458

Closed ghost closed 3 years ago

ghost commented 4 years ago

In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.

Instructions

  1. Download the LibriSpeech/train-other-500, and VoxCeleb 1/2 datasets. Extract these to your folder as follows:
    • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
    • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
    • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)
  2. Change model_hidden_size to 768 in encoder/params_model.py
  3. python encoder_preprocess.py <datasets_root>
  4. Open a separate terminal and start visdom
  5. python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder
ghost commented 4 years ago

I have a tutorial for you @mbdash (this was a good learning experience for me too).

  1. In encoder/params_model.py update model_hidden_size = 768 if you haven't already
  2. Initialize a new model with the correct dimensions, save it after 1 step then ctrl+c to stop training
    python encoder_train.py new_model datasets_root/SV2TTS/encoder/ -b 1
  3. Verify that it generated the file encoder/saved_models/new_model_backups/new_model_bak_000001.pt
  4. Download english_run.pt, the 768/768 English model trained to 2,143,500 steps from https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-671675613
  5. Save this gist to transfer_encoder_weights.py . Pytorch model checkpoints are simple dictionaries and it is trivial to make edits.
  6. Put the files from steps 3-5 in the same location
  7. Run the script
    python transfer_encoder_weights.py
  8. It saves a file called modified_encoder.pt, move it to encoder/saved_models
  9. We only want to train the linear transformation at the end of the model that projects the final hidden layer (size 768) down to the desired embedding size (256) so we set requires_grad=False on the model elements that we don't want to update
  10. Now train the modified model that we created
    python encoder_train.py modified_encoder datasets_root/SV2TTS/encoder/
  11. Let it run until you can go 1,000 steps without the loss spiking above 0.1. At this point we will know that the nn.Linear elements are properly set for the encoder weights that we imported
  12. Revert the changes from step 9 to re-enable grad on all model elements
  13. Continue training using the command in step 10
ghost commented 4 years ago

And @sberryman is right! It looks like the synth needs to be retrained. I finetuned the 768/256 encoder to 2,144,100 steps (added 600 steps) and get garbage out when I try to synthesize text. This result makes sense in context of how the encoder is optimized; for a given utterance the loss function doesn't care about the specific values of the embedding as long as it is close to other embeds derived from the same speaker, and far from utterance embeds of other speakers.

ghost commented 4 years ago

It looks like the synth needs to be retrained.

To make the synth more portable we could run the speaker embedding through a linear projection before the concat with the encoder output. Then to make the synth compatible with a new encoder, we can use the same trick where we requires_grad=False on all model elements except the linear projection to train it. After the loss comes down we can re-enable grad to finetune the synth.

sberryman commented 4 years ago

Looks like I missed a lot but you are on the right track. Anything you modify upstream requires all downstream modules to be retrained. I don't think you are going to get lucky trying to hack existing weights into the stream. Will be interesting to hear if your linear layer idea worked though.

mbdash commented 4 years ago

I did re-prerpocess the dataset which had broken log files,

but I currently am still stuck here: image

ghost commented 4 years ago

Exception: Can't create RandomCycler from an empty collection

That got me several times while testing out encoder training (to demonstrate https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585 would work).

Basically the issue is that training is crashing on nearly empty folders (containing only _sources.txt), which can be dealt with easily in Linux:

(Updated command based on feedback below)

find datasets_root/SV2TTS/encoder -type f -name _sources.txt -empty -exec rm {} \;  #delete empty _sources.txt files
find datasets_root/SV2TTS/encoder -type d -empty -exec rmdir {} \;  #delete the containing folders
mbdash commented 4 years ago

the 1st command returned me nothing ....

However, looking at it, I found a lot of CommonVoice folders with only a source.txt file.

The command above should be ran on the dataset prior to pretraining i guess.

Update: Ok I got leazy and basically used Bitvise sftp to list and sort by size the folders and delete any folder <=34kb And the training started.

So basically, It is running on LibriSpeech, CommonVoice and VCTK. and VoxCeleb 1&2 are ready in another folder.

mbdash commented 4 years ago

I have a tutorial for you @mbdash (this was a good learning experience for me too).

  1. In encoder/params_model.py update model_hidden_size = 768 if you haven't already
  2. Initialize a new model with the correct dimensions, save it after 1 step then ctrl+c to stop training
python encoder_train.py new_model datasets_root/SV2TTS/encoder/ -b 1
  1. Verify that it generated the file encoder/saved_models/new_model_backups/new_model_bak_000001.pt
  2. Download english_run.pt, the 768/768 English model trained to 2,143,500 steps from #458 (comment)
  3. Save this gist to transfer_encoder_weights.py . Pytorch model checkpoints are simple dictionaries and it is trivial to make edits.

  4. Put the files from steps 3-5 in the same location
  5. Run the script
python transfer_encoder_weights.py
  1. It saves a file called modified_encoder.pt, move it to encoder/saved_models
  2. We only want to train the linear transformation at the end of the model that projects the final hidden layer (size 768) down to the desired embedding size (256) so we set requires_grad=False on the model elements that we don't want to update

  3. Now train the modified model that we created
python encoder_train.py modified_encoder datasets_root/SV2TTS/encoder/
  1. Let it run until you can go 1,000 steps without the loss spiking above 0.1. At this point we will know that the nn.Linear elements are properly set for the encoder weights that we imported
  2. Revert the changes from step 9 to re-enable grad on all model elements
  3. Continue training using the command in step 10

Mkay.... so i guess i need to look into this tomorrow, that is a lot of step when tired.

mbdash commented 4 years ago

LibriSpeech + CommonVoice + VCTK

With none of the instructions above.

image

ghost commented 4 years ago

Nice!! It took a while to resolve all the dataset issues but the model is finally training 😄

Please share some visdom screenshots from time to time: https://user-images.githubusercontent.com/324437/65079232-877add80-d953-11e9-921e-abe695803f53.png

If you want to try the tutorial to transfer weights from the 768/768 encoder, it can be done without the GPU. Just add the following to the top of encoder_train.py before you run it, to use the CPU.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

We will learn more from training from scratch than trying to transfer encoder weights. The tutorial is there if you want to learn how to transfer weights from a pretrained model to another model. It should take about 30 minutes to complete (plus compute time for training).

mbdash commented 4 years ago

LibriSpeech, CommonVoice & VCTK only model_hidden_size = 768, 72k steps

RTVC_encoder_mdl_ls_cv_vctk_vc12 (currently 72k steps for LibriSpeech, CommonVoice & VCTK only) https://drive.google.com/drive/folders/1hg65MdHOA_b20RzF5roA2pnoFDZy4oWQ?usp=sharing

encoder_mdl_ls_cv_vctk_vc12_umap_072000

Edit: I used the flag --no_visdom since it was not working in the view Edit2: i will bring it to 100k before adding VoxCeleb1&2, except if you have a different opinion. I will try your tutorial for weight transfer later, once we have a purely trained encoder based on the presets at the beginning of this thread.

ghost commented 4 years ago

i will bring it to 100k before adding VoxCeleb1&2, except if you have a different opinion.

It has figured out how to group utterances from the same speaker but not so much how to separate different speakers. I think you can let this go until 200-250k before adding VoxCeleb.

Also there no need to upload any .pt files at this time, just the training .png files. Though the final .pt checkpoint before adding VoxCeleb would be a helpful data point.

Edit: What is the current loss value?

mbdash commented 4 years ago

image

ghost commented 4 years ago

Thanks @mbdash . Still think it would be beneficial to run it to 200-250k steps before adding in VoxCeleb to get the loss down. It would also help answer whether a good encoder can be obtained without VoxCeleb since that's a monster of a dataset.

So let it run for another 2-3 days and add in VoxCeleb at some convenient time during that interval?

mbdash commented 4 years ago

encoder_mdl_ls_cv_vctk_vc12_umap_093300

mbdash commented 4 years ago

image

ghost commented 4 years ago

Thanks for the data point, I'll suggest adding in VoxCeleb any time after the loss is consistently less than 0.01. Please share a few UMAP plots tomorrow so we can see if there's improvement in cluster separation.

mbdash commented 4 years ago

image encoder_mdl_ls_cv_vctk_vc12_umap_275700 encoder_mdl_ls_cv_vctk_vc12_umap_270000 encoder_mdl_ls_cv_vctk_vc12_umap_255000

ghost commented 4 years ago

@mbdash Looking good! You can add the VoxCeleb sets whenever convenient.

mbdash commented 4 years ago

encoder_mdl_ls_cv_vctk_vc12_umap_315000

image

alright we are way below the 0.01 loss target, i am going to add voxceleb

mbdash commented 4 years ago

here is the new numbers just after adding voxceleb

image

ghost commented 4 years ago

Excellent! Could you share the 315k pre-VoxCeleb checkpoint? My hypothesis is that the (LibriSpeech+VCTK+CommonVoice) encoder should be good enough for utterances that are recorded under similar conditions. Adding VoxCeleb should make it perform better for voice recordings gathered in the wild.

While the speaker encoder should perform better for celebrities included in that dataset, unless the TTS is also trained on similar voices I don't think it will help voice cloning of celebrities.

mbdash commented 4 years ago

encoder ls_cv_vctk_only 315k steps Loss < 0.005

https://drive.google.com/drive/folders/1OkHpeV3i5fGzI6shhjY3nkpN9jXGk7Ak?usp=sharing

Here is the link for the progression of ls_cv_vctk_315k_plus_vc12

https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

mbdash commented 4 years ago

For my personal project, if I wanted to clone my voice or a voice actor, should i only re-train the synth, or both the encoder and synth (using only datasets from a single voice?)

ghost commented 4 years ago

@mbdash For single-speaker finetuning, you should only retrain the synth. In #437, to make things converge faster we bypass the encoder and always feed the same speaker embedding input to the synth. This means there is no benefit to encoder finetuning on a single voice; in fact it would actually be harmful and increase the amount of synth training needed.

In https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-670248226 I made the observation that there are diminishing returns from improving the encoder when a single-speaker model is the goal.

mbdash commented 4 years ago

thank you for your feedback,

Here is the update on the encoder training, with voxceleb added. The checkpoint is been uploaded to the google drive. https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

image

encoder_mdl_ls_cv_vctk_vc12_umap_367500

ustraymond commented 4 years ago

To get a better idea on the model performance, should the EER be calculated on voxceleb "test" set / at least the whole training set?

Any hints on how to modify the codes (train.py?) to do so? Thanks!

ghost commented 4 years ago

@mbdash I am training VCTK with your 315k encoder. Deliberately avoided VoxCeleb as it doesn't need to perform well on celebrities or speech data in the wild. Trying to compare results with the SV2TTS authors, so I left out p240 and p260 from the training set. I spent an hour manually curating VCTK, throwing out about half of the speakers for various reasons (no UK or Irish accent - trying to help with #388, excessive unrelated sounds like fabric rustling or deep breaths before speaking each time). I'm also removing punctuation from the transcripts as I don't have the compute power to train that aspect of it well.

For preprocessing and training I had to bring the batch size down to make it fit in my GPU's limited memory (4 GB), since the bigger encoder model is loaded in memory. There's an advantage to having a lightweight encoder for TTS.

ghost commented 4 years ago

@ustraymond This is one way of doing it:

  1. Make a folder datasets_root/SV2TTS/encoder_test/ and move some folders over from encoder to make a test set.
  2. Modify speaker_verification_dataset.py to take 2 paths, one for training and test. Modify the DataLoader in the same file to return a training batch and test batch.
  3. In encoder/train.py, you would run forward/backward pass as normal on the training batch, and then follow that up with a forward pass on the test set to get loss and eer for display purposes. I think the test part should be wrapped with with torch.no_grad() and you might need to set model.eval() too.

This would slow down training considerably if performed every step, so you might want to run this evaluation every 10 or 100 steps.

ghost commented 4 years ago

@mueller91 wrote in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-604999848

Dear All, i've downloaded the models from @sberryman and adapted the hyper parameters accordingly. I created a few examples with them. I observe the following:

  1. the sound quality is pretty good (clearly understandable, no bleeps or blops etc.)
  2. the voice does not resemble the reference embedding. it's like a 'generic' voice.

I wonder why that is. Did anybody else experience this? Thanks!

Edit: For a while I thought I was also getting a "generic voice" using @mbdash 315k encoder trained on LibriSpeech, VCTK and CommonVoice. Please see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-678747512 for the new results.

The parameters for this experiment are:

(If anyone's wondering why the original results were so bad, I accidentally used the hacked 768/256 encoder from https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585 with a synth trained on embeds from mbdash 315k.)

mbdash commented 4 years ago

420k

https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

image encoder_mdl_ls_cv_vctk_vc12_umap_420000

ghost commented 4 years ago

Fixed experiment, please see below for the results or https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-678664495 for how it was conducted.

Please disregard what I said here initially, those conclusions are incorrect.

ghost commented 4 years ago

The previous data is bad because I used the wrong encoder for testing with @mbdash 315k model. Here is a fair comparison (old = Corentin's encoder, new = mbdash 315k encoder): wav_comparison.zip

Now we have a different and unexpected issue where the synthesized voices are very similar regardless of the encoder used! They are indistinguishable from my point of view. (You will notice that the synth trained with the old encoder has an annoying sound artifact at the end. It does not appear when Griffin-Lim is used to invert the spectrogram, so it is an artifact of the vocoder.)

When the encoder is better trained on VoxCeleb I will repeat this experiment to see if the similarity of the cloned voice improves. The authors of SV2TTS obtained better results on this metric. Which parts of our model need to improve to match their quality?

mbdash commented 4 years ago

image encoder_mdl_ls_cv_vctk_vc12_umap_500000

ustraymond commented 4 years ago

@ustraymond This is one way of doing it:

1. Make a folder `datasets_root/SV2TTS/encoder_test/` and move some folders over from encoder to make a test set.

2. Modify [speaker_verification_dataset.py](https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/a32962bb7b4827660646ac6dabf62309aea08a91/encoder/data_objects/speaker_verification_dataset.py#L11) to take 2 paths, one for training and test. Modify the DataLoader in the same file to return a training batch and test batch.

3. In encoder/train.py, you would run forward/backward pass as normal on the training batch, and then follow that up with a forward pass on the test set to get loss and eer for display purposes. I think the test part should be wrapped with `with torch.no_grad()` and you might need to set `model.eval()` too.

This would slow down training considerably if performed every step, so you might want to run this evaluation every 10 or 100 steps.

in the thesis, "In fact, we computed the test set EER to be4.5%. " "We refer to GE2E and use 6 utterances for enrollment and compare those to 7 utterances." ???

it seems there is no code here to do such calculation.

So I was trying to calculate the EER for whole set of test data, (40 speakers with 4000+ utterances), different speakers got different amount of utterance.

Assume I calculate the embedding of all utterance, what is the next step?

repeat the step? https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/model.py#L80

assume a speaker got at least 13 utterances?

get the centroid of each speaker based on 6 utterances ? centroids_incl

for another 7+ utterances, find its distance to centroids_incl? and find the smallest centroid and use that as predicted label?

then reuse these lines in model.py?

       # Snippet from https://yangcha.github.io/EER-ROC/
        fpr, tpr, thresholds = roc_curve(labels.flatten(), preds.flatten())           
        eer = brentq(lambda x: 1. - x - interp1d(fpr, tpr)(x), 0., 1.)

wonder how to calculate the EER 4.5%.?

thx for advice.

ghost commented 4 years ago

@ustraymond Code for calculating EER is shared in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/61#issuecomment-514653922 and https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-530960239 . The code is identical but the context for discussion is slightly different.

mueller91 commented 4 years ago

Dear @mbdash thank you for providing the GPU and publishing the models. One curious observation, though: i use your model to embed a batch of utterances, and compute the inter- and intra class cosine similarity (i.e. the cosine similarity for all pairs s_i, s_j where the speakers are different, or the same, repectively). i obtain mean inter-similarity of around 0.45, and intra-similarity of around 0.9

mbdash commented 4 years ago

@mueller91 I am not the guy you are looking for, the wise guy with the answers is @blue-fish.

Note that there is 2 encoder models. Be sure you are downloading the proper one.

"LibriSpeech + CommonVoice + VCTK Only until step 315k" The 1st one was only "LibriSpeech + CommonVoice + VCTK until step 315k" Available here: https://drive.google.com/drive/folders/1OkHpeV3i5fGzI6shhjY3nkpN9jXGk7Ak?usp=sharing

"LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2" The 2nd encoder model is the encoder above with VoxCeleb 1 & 2 added to the dataset after step 315k. (I stopped the training at 315k, then added more datasets (VoxCeleb), and resumed training at step 315k) I am constantly (daily) updating the new model "LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2" Available here: https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

The latest uploaded is 525k steps. Currently I am locally at step 531k.

I would wait for me to reach 750k to do anything with this encoder if I were you, I see the loss bouncing from .026 to .04 non stop currently.

image

ghost commented 4 years ago

@mueller91 I think your observations are explained by our continued use of ReLU (which is not a deliberate choice, we just used the repo code without modification). @sberryman removed ReLU in https://github.com/resemble-ai/Resemblyzer/issues/13 which causes inter-similarity to be centered around zero instead of 0.5 as in our case.

We will continue using ReLU as long as the encoder model needs to support Corentin's pretrained encoder which also uses it.

sberryman commented 4 years ago

@blue-fish how does @mbdash's model work with the existing synth/vocoder? I would assume not very well and producing a generic voice?

If that is the case, @mbdash should stop training and start over with Tanh as the final activation. Then you'll use @mbdash's new model to train the synthesizer and vocoder from scratch (several weeks worth of GPU time.)

Replicating Corentin's work training from scratch would likely require well over 700 hours of training using two 1080 TI's. They are using much larger GPUs (40GB of memory I believe) at Resemble.ai to train more quickly. The 700 hours is a very rough estimate to illustrate the several weeks of training for each of the three models.

mueller91 commented 4 years ago

@sberryman Could you elaborate why to chose Tanh as final activation?

ghost commented 4 years ago

@sberryman When I train a VCTK-based synthesizer to 100k steps on my basic GPU, I get a very similar result for voice cloning regardless of whether I use Corentin's model or @mbdash 315k model (LS+VCTK+CV) as the speaker encoder for training and inference. See: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-678747512

That result was completely unexpected and maybe I should delete my pycache just to be very sure that I performed the experiment properly. But because of the different hidden unit size it is impossible to use the wrong encoder with the wrong synthesizer.

Also, the pretrained encoder bundled with Resemblyzer is identical to the one in this repo, which I recall took ~20 days to train on a single 1080 TI.

sberryman commented 4 years ago

@mueller91 I used tanh as the final activation to force the values between -1 and 1 as opposed to 0-1 with ReLU. I never tried training with no final activation so I'm not sure how that would turn out to be honest.

https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-557269666

Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point. There is a chance tanh was only used on an experiment when I was trying to build an encoder model based on raw waveform

Edit 2: Corentin doesn't think it makes much of a difference though. https://github.com/resemble-ai/Resemblyzer/issues/15#issuecomment-555097662

Edit 3: This implementation doesn't use an activation function after the LSTM. https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/master/speech_embedder_net.py

ghost commented 4 years ago

Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point.

There are no states associated with the activation so it's not possible to tell with just by looking at the checkpoint file. When I hacked the final linear layer in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585 I noticed the loss came down very quickly using ReLU. The loss was already down to 0.01 within 10 steps of restarting training. So if I had to guess, that particular 768/768 English encoder was likely using ReLU.

@sberryman Thanks for digging up and sharing those additional links.

mbdash commented 4 years ago

At step 600k, loss is still moving between 0.015 and 0.04 But more often hitting lower values

https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

image

encoder_mdl_ls_cv_vctk_vc12_umap_600000

sberryman commented 4 years ago

@mbdash training is looking good, still some overlap on clusters. Have you tried to plot cross similarity matrixes?

https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-544716234

I also did a few plots for a much larger number of speakers here: https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-544729472

Default is the model included in this repository and 768 was the model I trained. It looks like my EER was 0.00392 at 2.38M steps.

mbdash commented 4 years ago

step 750K reached we are still around 0.03 loss

image

encoder_mdl_ls_cv_vctk_vc12_umap_757500

mueller91 commented 4 years ago

Dear @mbdash , any updates? If you find the time to share the current model, it'd be much appreciated! :)

mbdash commented 4 years ago

image encoder_mdl_ls_cv_vctk_vc12_umap_945000

https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing

loss 0.015 to 0.027

cheers

mbdash commented 4 years ago

1000000steps reached

loss 0.016 to 0.022

image

encoder_mdl_ls_cv_vctk_vc12_umap_1000100