Closed ghost closed 3 years ago
I have a tutorial for you @mbdash (this was a good learning experience for me too).
encoder/params_model.py
update model_hidden_size = 768
if you haven't alreadypython encoder_train.py new_model datasets_root/SV2TTS/encoder/ -b 1
encoder/saved_models/new_model_backups/new_model_bak_000001.pt
english_run.pt
, the 768/768 English model trained to 2,143,500 steps from https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-671675613transfer_encoder_weights.py
. Pytorch model checkpoints are simple dictionaries and it is trivial to make edits.
python transfer_encoder_weights.py
modified_encoder.pt
, move it to encoder/saved_models
requires_grad=False
on the model elements that we don't want to update
python encoder_train.py modified_encoder datasets_root/SV2TTS/encoder/
And @sberryman is right! It looks like the synth needs to be retrained. I finetuned the 768/256 encoder to 2,144,100 steps (added 600 steps) and get garbage out when I try to synthesize text. This result makes sense in context of how the encoder is optimized; for a given utterance the loss function doesn't care about the specific values of the embedding as long as it is close to other embeds derived from the same speaker, and far from utterance embeds of other speakers.
It looks like the synth needs to be retrained.
To make the synth more portable we could run the speaker embedding through a linear projection before the concat with the encoder output. Then to make the synth compatible with a new encoder, we can use the same trick where we requires_grad=False
on all model elements except the linear projection to train it. After the loss comes down we can re-enable grad to finetune the synth.
Looks like I missed a lot but you are on the right track. Anything you modify upstream requires all downstream modules to be retrained. I don't think you are going to get lucky trying to hack existing weights into the stream. Will be interesting to hear if your linear layer idea worked though.
I did re-prerpocess the dataset which had broken log files,
but I currently am still stuck here:
Exception: Can't create RandomCycler from an empty collection
That got me several times while testing out encoder training (to demonstrate https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585 would work).
Basically the issue is that training is crashing on nearly empty folders (containing only _sources.txt
), which can be dealt with easily in Linux:
(Updated command based on feedback below)
find datasets_root/SV2TTS/encoder -type f -name _sources.txt -empty -exec rm {} \; #delete empty _sources.txt files
find datasets_root/SV2TTS/encoder -type d -empty -exec rmdir {} \; #delete the containing folders
the 1st command returned me nothing ....
However, looking at it, I found a lot of CommonVoice folders with only a source.txt file.
The command above should be ran on the dataset prior to pretraining i guess.
Update: Ok I got leazy and basically used Bitvise sftp to list and sort by size the folders and delete any folder <=34kb And the training started.
So basically, It is running on LibriSpeech, CommonVoice and VCTK. and VoxCeleb 1&2 are ready in another folder.
I have a tutorial for you @mbdash (this was a good learning experience for me too).
- In
encoder/params_model.py
updatemodel_hidden_size = 768
if you haven't already- Initialize a new model with the correct dimensions, save it after 1 step then ctrl+c to stop training
python encoder_train.py new_model datasets_root/SV2TTS/encoder/ -b 1
- Verify that it generated the file
encoder/saved_models/new_model_backups/new_model_bak_000001.pt
- Download
english_run.pt
, the 768/768 English model trained to 2,143,500 steps from #458 (comment)Save this gist to
transfer_encoder_weights.py
. Pytorch model checkpoints are simple dictionaries and it is trivial to make edits.- Put the files from steps 3-5 in the same location
- Run the script
python transfer_encoder_weights.py
- It saves a file called
modified_encoder.pt
, move it toencoder/saved_models
We only want to train the linear transformation at the end of the model that projects the final hidden layer (size 768) down to the desired embedding size (256) so we set
requires_grad=False
on the model elements that we don't want to update
- See the modifications here: blue-fish@30a8c7b
- Now train the modified model that we created
python encoder_train.py modified_encoder datasets_root/SV2TTS/encoder/
- Let it run until you can go 1,000 steps without the loss spiking above 0.1. At this point we will know that the nn.Linear elements are properly set for the encoder weights that we imported
- Revert the changes from step 9 to re-enable grad on all model elements
- Continue training using the command in step 10
Mkay.... so i guess i need to look into this tomorrow, that is a lot of step when tired.
LibriSpeech + CommonVoice + VCTK
With none of the instructions above.
Nice!! It took a while to resolve all the dataset issues but the model is finally training 😄
Please share some visdom screenshots from time to time: https://user-images.githubusercontent.com/324437/65079232-877add80-d953-11e9-921e-abe695803f53.png
If you want to try the tutorial to transfer weights from the 768/768 encoder, it can be done without the GPU. Just add the following to the top of encoder_train.py
before you run it, to use the CPU.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
We will learn more from training from scratch than trying to transfer encoder weights. The tutorial is there if you want to learn how to transfer weights from a pretrained model to another model. It should take about 30 minutes to complete (plus compute time for training).
LibriSpeech, CommonVoice & VCTK only model_hidden_size = 768, 72k steps
RTVC_encoder_mdl_ls_cv_vctk_vc12 (currently 72k steps for LibriSpeech, CommonVoice & VCTK only) https://drive.google.com/drive/folders/1hg65MdHOA_b20RzF5roA2pnoFDZy4oWQ?usp=sharing
Edit: I used the flag --no_visdom
since it was not working in the view
Edit2: i will bring it to 100k before adding VoxCeleb1&2, except if you have a different opinion.
I will try your tutorial for weight transfer later, once we have a purely trained encoder based on the presets at the beginning of this thread.
i will bring it to 100k before adding VoxCeleb1&2, except if you have a different opinion.
It has figured out how to group utterances from the same speaker but not so much how to separate different speakers. I think you can let this go until 200-250k before adding VoxCeleb.
Also there no need to upload any .pt files at this time, just the training .png files. Though the final .pt checkpoint before adding VoxCeleb would be a helpful data point.
Edit: What is the current loss value?
Thanks @mbdash . Still think it would be beneficial to run it to 200-250k steps before adding in VoxCeleb to get the loss down. It would also help answer whether a good encoder can be obtained without VoxCeleb since that's a monster of a dataset.
So let it run for another 2-3 days and add in VoxCeleb at some convenient time during that interval?
Thanks for the data point, I'll suggest adding in VoxCeleb any time after the loss is consistently less than 0.01. Please share a few UMAP plots tomorrow so we can see if there's improvement in cluster separation.
@mbdash Looking good! You can add the VoxCeleb sets whenever convenient.
alright we are way below the 0.01 loss target, i am going to add voxceleb
here is the new numbers just after adding voxceleb
Excellent! Could you share the 315k pre-VoxCeleb checkpoint? My hypothesis is that the (LibriSpeech+VCTK+CommonVoice) encoder should be good enough for utterances that are recorded under similar conditions. Adding VoxCeleb should make it perform better for voice recordings gathered in the wild.
While the speaker encoder should perform better for celebrities included in that dataset, unless the TTS is also trained on similar voices I don't think it will help voice cloning of celebrities.
encoder ls_cv_vctk_only 315k steps Loss < 0.005
https://drive.google.com/drive/folders/1OkHpeV3i5fGzI6shhjY3nkpN9jXGk7Ak?usp=sharing
Here is the link for the progression of ls_cv_vctk_315k_plus_vc12
https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing
For my personal project, if I wanted to clone my voice or a voice actor, should i only re-train the synth, or both the encoder and synth (using only datasets from a single voice?)
@mbdash For single-speaker finetuning, you should only retrain the synth. In #437, to make things converge faster we bypass the encoder and always feed the same speaker embedding input to the synth. This means there is no benefit to encoder finetuning on a single voice; in fact it would actually be harmful and increase the amount of synth training needed.
In https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-670248226 I made the observation that there are diminishing returns from improving the encoder when a single-speaker model is the goal.
thank you for your feedback,
Here is the update on the encoder training, with voxceleb added. The checkpoint is been uploaded to the google drive. https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing
To get a better idea on the model performance, should the EER be calculated on voxceleb "test" set / at least the whole training set?
Any hints on how to modify the codes (train.py?) to do so? Thanks!
@mbdash I am training VCTK with your 315k encoder. Deliberately avoided VoxCeleb as it doesn't need to perform well on celebrities or speech data in the wild. Trying to compare results with the SV2TTS authors, so I left out p240 and p260 from the training set. I spent an hour manually curating VCTK, throwing out about half of the speakers for various reasons (no UK or Irish accent - trying to help with #388, excessive unrelated sounds like fabric rustling or deep breaths before speaking each time). I'm also removing punctuation from the transcripts as I don't have the compute power to train that aspect of it well.
For preprocessing and training I had to bring the batch size down to make it fit in my GPU's limited memory (4 GB), since the bigger encoder model is loaded in memory. There's an advantage to having a lightweight encoder for TTS.
@ustraymond This is one way of doing it:
datasets_root/SV2TTS/encoder_test/
and move some folders over from encoder to make a test set.with torch.no_grad()
and you might need to set model.eval()
too.This would slow down training considerably if performed every step, so you might want to run this evaluation every 10 or 100 steps.
@mueller91 wrote in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-604999848
Dear All, i've downloaded the models from @sberryman and adapted the hyper parameters accordingly. I created a few examples with them. I observe the following:
- the sound quality is pretty good (clearly understandable, no bleeps or blops etc.)
- the voice does not resemble the reference embedding. it's like a 'generic' voice.
I wonder why that is. Did anybody else experience this? Thanks!
Edit: For a while I thought I was also getting a "generic voice" using @mbdash 315k encoder trained on LibriSpeech, VCTK and CommonVoice. Please see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-678747512 for the new results.
The parameters for this experiment are:
(If anyone's wondering why the original results were so bad, I accidentally used the hacked 768/256 encoder from https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585 with a synth trained on embeds from mbdash 315k.)
Fixed experiment, please see below for the results or https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-678664495 for how it was conducted.
Please disregard what I said here initially, those conclusions are incorrect.
The previous data is bad because I used the wrong encoder for testing with @mbdash 315k model. Here is a fair comparison (old = Corentin's encoder, new = mbdash 315k encoder): wav_comparison.zip
Now we have a different and unexpected issue where the synthesized voices are very similar regardless of the encoder used! They are indistinguishable from my point of view. (You will notice that the synth trained with the old encoder has an annoying sound artifact at the end. It does not appear when Griffin-Lim is used to invert the spectrogram, so it is an artifact of the vocoder.)
When the encoder is better trained on VoxCeleb I will repeat this experiment to see if the similarity of the cloned voice improves. The authors of SV2TTS obtained better results on this metric. Which parts of our model need to improve to match their quality?
@ustraymond This is one way of doing it:
1. Make a folder `datasets_root/SV2TTS/encoder_test/` and move some folders over from encoder to make a test set. 2. Modify [speaker_verification_dataset.py](https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/a32962bb7b4827660646ac6dabf62309aea08a91/encoder/data_objects/speaker_verification_dataset.py#L11) to take 2 paths, one for training and test. Modify the DataLoader in the same file to return a training batch and test batch. 3. In encoder/train.py, you would run forward/backward pass as normal on the training batch, and then follow that up with a forward pass on the test set to get loss and eer for display purposes. I think the test part should be wrapped with `with torch.no_grad()` and you might need to set `model.eval()` too.
This would slow down training considerably if performed every step, so you might want to run this evaluation every 10 or 100 steps.
in the thesis, "In fact, we computed the test set EER to be4.5%. " "We refer to GE2E and use 6 utterances for enrollment and compare those to 7 utterances." ???
it seems there is no code here to do such calculation.
So I was trying to calculate the EER for whole set of test data, (40 speakers with 4000+ utterances), different speakers got different amount of utterance.
Assume I calculate the embedding of all utterance, what is the next step?
repeat the step? https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/model.py#L80
assume a speaker got at least 13 utterances?
get the centroid of each speaker based on 6 utterances ? centroids_incl
for another 7+ utterances, find its distance to centroids_incl? and find the smallest centroid and use that as predicted label?
then reuse these lines in model.py?
# Snippet from https://yangcha.github.io/EER-ROC/
fpr, tpr, thresholds = roc_curve(labels.flatten(), preds.flatten())
eer = brentq(lambda x: 1. - x - interp1d(fpr, tpr)(x), 0., 1.)
wonder how to calculate the EER 4.5%.?
thx for advice.
@ustraymond Code for calculating EER is shared in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/61#issuecomment-514653922 and https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-530960239 . The code is identical but the context for discussion is slightly different.
Dear @mbdash thank you for providing the GPU and publishing the models. One curious observation, though: i use your model to embed a batch of utterances, and compute the inter- and intra class cosine similarity (i.e. the cosine similarity for all pairs s_i, s_j where the speakers are different, or the same, repectively). i obtain mean inter-similarity of around 0.45, and intra-similarity of around 0.9
@mueller91 I am not the guy you are looking for, the wise guy with the answers is @blue-fish.
Note that there is 2 encoder models. Be sure you are downloading the proper one.
"LibriSpeech + CommonVoice + VCTK Only until step 315k" The 1st one was only "LibriSpeech + CommonVoice + VCTK until step 315k" Available here: https://drive.google.com/drive/folders/1OkHpeV3i5fGzI6shhjY3nkpN9jXGk7Ak?usp=sharing
"LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2" The 2nd encoder model is the encoder above with VoxCeleb 1 & 2 added to the dataset after step 315k. (I stopped the training at 315k, then added more datasets (VoxCeleb), and resumed training at step 315k) I am constantly (daily) updating the new model "LibriSpeech + CommonVoice + VCTK until step 315k + VoxCeleb1&2" Available here: https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing
The latest uploaded is 525k steps. Currently I am locally at step 531k.
I would wait for me to reach 750k to do anything with this encoder if I were you, I see the loss bouncing from .026 to .04 non stop currently.
@mueller91 I think your observations are explained by our continued use of ReLU (which is not a deliberate choice, we just used the repo code without modification). @sberryman removed ReLU in https://github.com/resemble-ai/Resemblyzer/issues/13 which causes inter-similarity to be centered around zero instead of 0.5 as in our case.
We will continue using ReLU as long as the encoder model needs to support Corentin's pretrained encoder which also uses it.
@blue-fish how does @mbdash's model work with the existing synth/vocoder? I would assume not very well and producing a generic voice?
If that is the case, @mbdash should stop training and start over with Tanh as the final activation. Then you'll use @mbdash's new model to train the synthesizer and vocoder from scratch (several weeks worth of GPU time.)
Replicating Corentin's work training from scratch would likely require well over 700 hours of training using two 1080 TI's. They are using much larger GPUs (40GB of memory I believe) at Resemble.ai to train more quickly. The 700 hours is a very rough estimate to illustrate the several weeks of training for each of the three models.
@sberryman Could you elaborate why to chose Tanh as final activation?
@sberryman When I train a VCTK-based synthesizer to 100k steps on my basic GPU, I get a very similar result for voice cloning regardless of whether I use Corentin's model or @mbdash 315k model (LS+VCTK+CV) as the speaker encoder for training and inference. See: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-678747512
That result was completely unexpected and maybe I should delete my pycache just to be very sure that I performed the experiment properly. But because of the different hidden unit size it is impossible to use the wrong encoder with the wrong synthesizer.
Also, the pretrained encoder bundled with Resemblyzer is identical to the one in this repo, which I recall took ~20 days to train on a single 1080 TI.
@mueller91 I used tanh as the final activation to force the values between -1 and 1 as opposed to 0-1 with ReLU. I never tried training with no final activation so I'm not sure how that would turn out to be honest.
https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-557269666
Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point. There is a chance tanh was only used on an experiment when I was trying to build an encoder model based on raw waveform
Edit 2: Corentin doesn't think it makes much of a difference though. https://github.com/resemble-ai/Resemblyzer/issues/15#issuecomment-555097662
Edit 3: This implementation doesn't use an activation function after the LSTM. https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/master/speech_embedder_net.py
Edit: I didn't do a good job documenting and checking in code throughout all the experiments so there is a chance the model I trained for 1M+ steps didn't have a final activation. @blue-fish probably knows that better than me at this point.
There are no states associated with the activation so it's not possible to tell with just by looking at the checkpoint file. When I hacked the final linear layer in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-673341585 I noticed the loss came down very quickly using ReLU. The loss was already down to 0.01 within 10 steps of restarting training. So if I had to guess, that particular 768/768 English encoder was likely using ReLU.
@sberryman Thanks for digging up and sharing those additional links.
At step 600k, loss is still moving between 0.015 and 0.04 But more often hitting lower values
https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing
@mbdash training is looking good, still some overlap on clusters. Have you tried to plot cross similarity matrixes?
https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-544716234
I also did a few plots for a much larger number of speakers here: https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-544729472
Default is the model included in this repository and 768 was the model I trained. It looks like my EER was 0.00392 at 2.38M steps.
step 750K reached we are still around 0.03 loss
Dear @mbdash , any updates? If you find the time to share the current model, it'd be much appreciated! :)
https://drive.google.com/drive/folders/1QBeC8_PKFn-ZpZzsWY3vIFZKd4dk3g9O?usp=sharing
loss 0.015 to 0.027
cheers
1000000steps reached
loss 0.016 to 0.022
In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.
Instructions
LibriSpeech/train-other-500
)VoxCeleb1/wav
andVoxCeleb1/vox1_meta.csv
)VoxCeleb2/dev
)model_hidden_size
to 768 in encoder/params_model.pypython encoder_preprocess.py <datasets_root>
visdom
python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder