Closed sberryman closed 4 years ago
@sberryman I´m training a synthesizer on librispeech on the unmodified code from 0 and after 10k steps got loss around 10. Yours seems to be around 0.8 Is it so?
Hi @railsloes,
Based on the graphs above you are correct that the loss was around 0.8 by 10k steps. I used LibriTTS not the LibriSpeech dataset CorentinJ trained on. The difference being that the audio sample rate was 24 kHz in LibriTTS (along with a few other differences.)
I was NOT able to obtain alignment using the LibriTTS dataset while training the synthesizer.
@sberryman Thank you very much!! It was my fault. I was training a modified version with a bug on the gradients.
Hi guys, I try to train the encoder on a mandarin dataset and encounter a problem. Can you guys take a look at this? https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/192
@shawwn - Have you tried the models yet? I was just doing some testing and every voice I tried to clone sounded the same. Wondering if you experienced the same? (They all sounded robotic and female)
My assumption is the synthesizer and vocoder didn't train properly as I'm able to cluster voices using the encoder.
@sberryman Hi, did you solve the problem that your vocoder model clone the sounds all female, I have meet the same problem?
@WenjianDing I have not attempted to train the vocoder and synthesizer again so I have not solved the problem with all the voices sounding the same/female. I'm more focused on the encoder for speaker diarization.
@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!
Encoder
https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0
Synthesizer (Tacotron)
https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0
Vocoder
@sberryman thanks a lot for the models. Could you share respective parameter settings as well? I mean the following three files: encoder\params_model.py
, synthesizer\hparams.py
, vocoder\hparams.py
. You mentioned some of the parameters in the thread, but it's not clear which of them have to be applied when using these models.
@sberryman Can you share some images generated by the tools "Resemblyzer", like follows. I downloaded the pretrained model offered by CorentinJ, and finetune with chinese corpus (5000 speakers) with lr-0.00001, the the embedding seems not very good even though the loss becomes to 0.005. Can you share your result for reference?
@Liujingxiu23 Apologize for the delayed response as I was out of town but the plots look okay to me. How many additional steps did you finetune? It looks like it could be trained longer. Also take a look at this issue as I made a ton of comments and posted a lot of plots. https://github.com/resemble-ai/Resemblyzer/issues/13
@sberryman Thank you very much for your reply. The similarity image may be somewrong, I finetune the model again with chinese corpus(9000 speakers) and tested on different steps. Though the similar values get better with step increase, the speed is really slow.
Utterance Same - Median: 0.757(1.5675M) 0.768(1.8075M) 0.712(2.07M) 0.701(2.3625M) Different - Median: 0.916(1.5675M) 0.910(1.8075M) 0.921(2.07M) 0.920(2.3625M)
Speaker Same - Median: 0.823(1.5675M) 0.832(1.8075M) 0.784(2.07M) 0.770(2.3625M) Different - Median: 0.979(1.5675M) 0.980(1.8075M) 0.976(2.07M) 0.976(2.3625M)
I guess training from scratch may be a better choice.
You have had so many disscusion at https://github.com/resemble-ai/Resemblyzer/issues/13, I am tring to train a new model like yours, the loss and err decream much faster. Thank you so much!
By the way, since the SV2TTS paper use 18k speakers, you have more speakers then, I guess you many get a good encoder? Then have you got some good end-to-end results, I mean synthesis wavs of speaker unseen.
@Liujingxiu23 So glad to hear that the Resemblyzer thread has helped you! @CorentinJ has been incredibly helpful answering my questions.
There have been quite a few people asking for a Chinese embedding, if you are able to post a link to your trained model I'm sure it would be helpful to quite a few people.
I have been experimenting on a completely new model for the embedding and am making a lot of progress. I've been using quite a few languages on in same model (including Chinese from http://openslr.org/82/ but that is only 1,000 speakers) Right now I'm training using 37,606 speakers of which a little more than 50% are English. Is the 9,000 speaker Chinese dataset you are using available for download? I'm always trying to add more speakers from different languages.
I have been focused on speaker embedding not the full pipeline, I've only attempted to train the vocoder and synthesizer once and didn't have the best success.
@sberryman About 2600 speakers can download from http://openslr.org/resources.php , you can use key word "Chinese" to find them, for example, SLR38, SLR68. I am sorry I can not share other datasets and the model. And I did not use SLR82, I can't remember the exact reason, maybe the wavs in this dataset have loud background music.
The training of the encoder model is so time consuming,I cant wait to train a synthesizer model (when the loss is about 0.01), but the result is not good. Maybe I should wait for more days.
@Liujingxiu23 I've already included SLR68 in my training dataset as well as SLR82. You are correct on SLR82, the audio has a lot of background noise (music, sound effects, other people talking, etc.) From what I remember based on conversations with CorentinJ and the paper, background audio is not bad for the encoder training. In fact it helps the encoder as it learns to focus on the spoken audio.
I completely agree, I've probably spent well over 90 days on various experiments related to encoder training. Right now I'm focusing on adding random noise to the speakers to make a more robust encoder. Training has been very tricky on though.
@sberryman Have you tried other features or other version of SV.
For the feature, the SV paper says they use "40-dimension log-mel-filterbank energies as the features for each frame" . This may differ from the feature we use now. I cannot judge how much influence about this diff.
For the SV, I am tring to run "https://github.com/Janghyun1230/Speaker_Verification/". I tring different learning rate now. However, the learning is also very slow, it seems hopeless and not helpfull to me now.
@Liujingxiu23 I have not tried adjusting the mel spectrogram features. Personally, I have a feeling using features (spectrogram) as input to the model can be avoided... At least for my use case.
My motivations for the speaker encoder are not inline with replicating voices, I'm more interested in using the embedding for speaker diarization.
Right now I'm training a completely new model on English using 22,553 speakers.
Quick question. So if I want to make this speech generation work well on a specific person do I need to train it on a bunch of their annotated speech. Or can I just use the pretrained weights with a bunch of their recordings. Or am I looking for a completely different network architecture??
@sberryman I'd be interested in hearing more samples.
In my experience, target=16000 overlap=800 produces high quality pop-free audio. I used it to make Dr. Kleiner sing: https://www.reddit.com/r/HalfLife/comments/d2rzf0/deepfaked_dr_kleiner_sings_i_am_the_very_model_of/
how did you train this exactly? I am interested in doing something similar
@sberryman I downloaded your models but I get this error when trying to load them. I tried using your fork but still doesn't load. Any recommendations? Thanks
model_hidden_size = 256
model_embedding_size = 256
model_hidden_size = 768
model_embedding_size = 768
@LordBaaa , did you adjust /encoder/params_model.py
? It is complaining about a mismatch of dimensions from what the model expects and what is in the weights file.
Well like I said I changed the model_hidden_size
&
model_embedding_size
in encoder/params_model.py
To 768 Like you had mentioned in the past (I read through all the old comments). Should I change anything else, what am I missing?
@sberryman (I didn’t put @ at the front last time so I figured maybe you didn’t see my response) Also I have been working on training a synth model. I notice the synth and vocoder are the least trained in the pretrained models. I have been gathering processing datasets with the pre-process. I want to add in some other languages from common voice to. Sometimes I need to stop the preprocessing but when I do it has to do it all from scratch next time. Is there a function I don’t know about to pause/save progress or do I need to see about adding one? Thanks
@LordBaaa I apologize, I did miss your last response. As far as testing the encoder, that should be all you would need to change. I've never used any of CorentinJ's GUI features so I'm not sure how my encoder model will work there.
I did train a synthesizer and vocoder model based on my encoder weights but it didn't work out very well. My focus has been on the encoder only, unfortunately I am not much help on the other two components.
I've also used the entire Common Voice dataset as part of training (every language available) which is a great resource. I think I've managed to compile various datasets to get my unique speaker count up to just over 39,000.
I also had a lot of conversations with CorentinJ over on his Resemblyzer repository for Resemble AI. https://github.com/resemble-ai/Resemblyzer/issues/13
There is a feature to "resume" the pre-processing step. If you look in the encoder_preprocess.py
file you'll see a command line argument called --skip_existing
which will skip over files which have already been processed.
https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder_preprocess.py#L37-L39
@sberryman Thanks, yeah --skip_existing is in synthesizer preprocess as well. Functionally though I think it does the work and then checks if it exists. I was looking at how some of the code worked (didnt get super deep so I may be wrong) but mainly I say this cause it will continue to use up CPU on like 1/900 even though I know it already exists. Plus it would still be nice to have the ability to pause it as opposed to having to wait or completely close the window
@LordBaaa The skip existing code works, I used it many MANY times. Look at these two lines, they immediately proceed the preprocess_wav
function which processes each source file.
https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/preprocess.py#L93-L94
There is a ton of room for improving the experience but this is research code. And quite frankly, it is one easiest to read and most well documented research projects I've come across.
This is an amazing effort by you guys. Thank you for all the assistance.
I am trying to get the latest models from sberryman working but get the following output:
Arguments: enc_model_fpath: encoder/saved_models/pretrained.pt syn_model_dir: synthesizer/saved_models/logs-pretrained voc_model_fpath: vocoder/saved_models/pretrained/pretrained.pt low_mem: False no_sound: False
Running a test of your configuration...
Found 1 GPUs available. Using GPU 0 (GeForce GTX 980M) of compute capability 5.2 with 4.2Gb total memory.
Preparing the encoder, the synthesizer and the vocoder...
Loaded encoder "pretrained.pt" trained to step 2152001
Found synthesizer "pretrained" trained to step 324000
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at vocoder/saved_models/pretrained/pretrained.pt
Traceback (most recent call last):
File "demo_cli.py", line 63, in
Any thoughts on this? \
@bmccallister Yeah I don’t know at this point. If you see my issue it’s very similar but with different torch.Size number problem. This is basically what I have been facing
@bmccallister Yeah I don’t know at this point. If you see my issue it’s very similar but with different torch.Size number problem. This is basically what I have been facing
It appears to me that the size of the model that was used in the synthesizer does not match against what is expected in the vocoder.
I had to make assumptions with the vocoder as there was no checklist file in the tacotron pretrained zip file so I copied one I had to make it get past that part of the script.
Have you or anyone else been able to build any other models and publish them anywhere?
I’ll also say I did a search in every parameter I could for the 25 value - to see if I could change it to 17 and nothing worked - this is why I think it must be an issue with the relationship to one of his other supplies pretrained models.
@bmccallister #257 I haven't tried them myself doe
I don’t think I ever released my synthesizer or vocoder models which were trained on my encoder. They were so poor that I trashed them. Maybe I did and forgot but I wouldn’t recommend using them if I did.
@LordBaaa, @bmccallister this is what I changed when trying all three trained models Shaun shared in the thread. Resulting quality was poor indeed. Please try to change models' settings as follows:
encoder/params_model.py
model_hidden_size = 256
model_embedding_size = 768
synthesizer/hparams.py
speaker_embedding_size=768
vocoder/hparams.py
Just add these lines at the end of the file
n_fft=2048
hop_size=300
win_size=1200
sample_rate=24000
speaker_embedding_size=768
voc_upsample_factors=(5, 5, 12)
@LordBaaa, @bmccallister this is what I changed when trying all three trained models Shaun shared in the thread. Resulting quality was poor indeed. Please try to change models' settings as follows:
encoder/params_model.py
model_hidden_size = 256 model_embedding_size = 768
synthesizer/hparams.py
speaker_embedding_size=768
vocoder/hparams.py Just add these lines at the end of the file
n_fft=2048 hop_size=300 win_size=1200 sample_rate=24000 speaker_embedding_size=768 voc_upsample_factors=(5, 5, 12)
Thank you for this info! So you said that after these pajama quality was bad? Worse than the pt models provided by corentin?
Did you have any success finding a model that worked better than corentins?
I’m sure we will figure this out!! :)
I don’t think I ever released my synthesizer or vocoder models which were trained on my encoder. They were so poor that I trashed them. Maybe I did and forgot but I wouldn’t recommend using them if I did.
Thank you for your response!
So should we not bother using the other samples you provided up thread?( it looks like you deleted the comment you had with the links but natravedova wuoted you so you can find your links posted up thread.
Are the encoder synth and vocoder not linked and the models not needed to be done sequentially?
May I ask sberryman what your highest level of success has been and if you have tips for repeating it or perhaps we could begin a repo to host pretrained models we have all worked on?
@bmccallister I was extremely happy with the encoder model I've trained. Although if I were to retrain a new model from scratch I would use 256 as the embedding dimension and leave 768 hidden units. I would have also replaced the ReLU activation with Tanh as Corentin mentioned in this thread or the one on Resemblizer.
They are linked. If you make any changes to the encoder you need to re-train everything downstream.
Since my focus was never to recreate a voice I never spent much time on the synthesizer or vocoder. If I were to attempt multispeaker synthesis right now, I would be using mellotron from nvidia as my base. https://github.com/NVIDIA/mellotron
@LordBaaa, @bmccallister this is what I changed when trying all three trained models Shaun shared in the thread. Resulting quality was poor indeed. Please try to change models' settings as follows: encoder/params_model.py
model_hidden_size = 256 model_embedding_size = 768
synthesizer/hparams.py
speaker_embedding_size=768
vocoder/hparams.py Just add these lines at the end of the file
n_fft=2048 hop_size=300 win_size=1200 sample_rate=24000 speaker_embedding_size=768 voc_upsample_factors=(5, 5, 12)
Thank you for this info! So you said that after these pajama quality was bad? Worse than the pt models provided by corentin?
It was worse than the default pt models. All voices sounded very similar, there was no difference between male and female voices. Though there is a chance that I did something wrong.
Did you have any success finding a model that worked better than corentins?
Unfortunately not.
@sberryman hello , my name is Dinesh, i plan to generate english audio but in indian accent so i started training the model from scratch starting with encoder. the encoder is doing good but im stuck with synthesizer as i dont have time-aligned transcript of audio files. so i thought i could download pretrained synthesizer and pretrained vocoder and generate audio. it did generate audio from sample voice but it still has american accent. on reading CorentinJ's thesis more carefully i came to know that wavenet is responsible for naturalness in generated voice. so now i'm planning to train only the vocoder on mel- spectrograms generated from downloaded pretrained synthesizer. do you think this works? and if it does , how should i proceed. i would really appreciate it if you could give any insight on how to tackle this problem.
@gdineshk6174 Hi Dinesh!
I'm not an expert and I failed to generate a good synthesizer and vocoder model so anything I say, please don't take it as fact. You should be able to use the pretrained encoder and fine-tune it on your Indian accent dataset (most likely won't require much fine tuning, may not require any.) Once the encoder is producing tight, easily distinguishable clusters for each speaker you can move on to the synthesizer. The most important thing from what I've read on the synthesizer/vocoder is to have clean audio. Meaning you don't want background noise in the audio. You'll also want quite a bit of training data, this is usually the hardest part.
I never thought about skipping the encoder and synthesizer and jumping straight to the vocoder using the pre-trained models. You can try it and see how it performs, would be interesting if it works and produces high quality speech. Hopefully you have plenty of GPUs available and lots of time, training and running experiments takes quite a bit of time.
Sberryman - thank you again for all the help and responses in this thread. Really nice of you to take your time.
I've read through a good portion of https://puu.sh/DHgBg.pdf to try to understand how all this works.
It does appear that the encoder creates the embedding, the synthesizer uses this to build the spectrogram and the vocoder outputs the waveform.
It occurs to me these processes are sequential and linked. Would it be possible to start with your heavily trained encoder, and then hook up to arbitrary datasets for the syntheszer and vocoder?
IE: Can i start the process with your pretrained encoder and then move on to synth and vocoder after?
My goal is to produce multispeaker (single speaker is honestly ok) english with no accent at all. It seems like that should be relatively simple, but i continue running into issues combining pretraining models (size / scale mismatches) etc.
I've also looked at the nvidia mellotron, but when i started working to get the project to work - i had some python mismatches which made me afraid i might never get the corentin project to run again if i messed with it :)
@sberryman hi, you trained encoder module for speaker verification task. Have you benchmarked your model with any dataset? if you have, could you share your benchmark results and dataset used for benchmarking? I have benchmarked pre-trained model on the voxceleb1 dataset and results are not looking good. I am getting EER of 8%.
@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!
Encoder
https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0
Synthesizer (Tacotron)
https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0
Vocoder
Dear All, i've downloaded the models from @sberryman and adapted the hyper parameters accordingly. I created a few examples with them. I observe the following: 1) the sound quality is pretty good (clearly understandable, no bleeps or blops etc.) 2) the voice does not resemble the reference embedding. it's like a 'generic' voice.
I wonder why that is. Did anybody else experience this? Thanks!
Encoder: trained 1.56M steps (20 days with a single GPU) with a batch size of 64 Synthesizer: trained 256k steps (1 week with 4 GPUs) with a batch size of 144 Vocoder: trained 428k steps (4 days with a single GPU) with a batch size of 100
I am trying to squeeze just a little more quality out of Corentin's pretrained models by continuing to train the vocoder while leaving the other models unchanged. This also seems like a reasonable place to start as I still have much to learn. Has anyone else tried this?
My GPU only has 4gb so I reduced the batch size from 100 to 50 to make it fit. I am otherwise using default parameters and the same training set as in the wiki. Loss is slowly but steadily decreasing, from 3.682 to 3.677 after 10 epochs. I'll continue the training and see if results are noticeably better.
Hi @blue-fish I think the vocoder is actually the strongest part. The synthesiser is what makes or breaks the model. If you look at the mfccs, you will notice that they are quite weird sometimes. For example they contain large pauses. If you want to improve the model, train a new Synthesizer and possibly a new encoder. I would suggest using mozillas TTS as a baseline, the code here is outdated. Also, use LibriTTS.
I've added another 600k steps to the pretrained vocoder. Loss started at 3.682 and is currently at 3.647. Though I hear an improvement in the samples produced during training, voice cloning results are unchanged. Is there a procedure to benchmark performance?
Hey @blue-fish do you plan to share your models and if so could I get them. Even if they are not finished training I’d be curious to hear the difference. Thanks. P.S I am unaware of a benchmark procedure.
Here are some samples @LordBaaa , can you hear the difference? I also provide a download link for the in-work model. No changes to hparams are needed to use it.
Samples: wavs.zip Model: https://www.dropbox.com/s/2skjbec4d67q3zo/vocoder_1159k.pt?dl=0
@blue-fish awesome thanks! It’s subtle but yes I can here a difference. Listening to both of the 428 vs 1159 I feel like I hear a slight amount of “background noise”. Like when some leaves there mic on continuous transmission and there is a little bit of like ambient noise. I hear it particularly on the male voice. It seems the improvement makes it “cleaner”. When the male voice stops in 428 there is an audio pop/drop. His noisiness I think is most notable on his last few words. In 1159 pop is gone, it is more continuous and the background noise is less or not there. Again like I say very subtle but it is better.
Thanks for the feedback @LordBaaa . I generated that sample five times on the 428k model trying to get that pop to go away, before I became convinced that it was a feature of the model.
Hello @sberryman! Could you provide pretrained weights from https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-532400349 for Mixed version?
@blue-fish The wavs that you shared sounds good! Are the wavs just the result of vocoder, or an end2end results which using encoder to predict the embedding then using tacotron and vocoder model to synthesize?
@Liujingxiu23 They are end-to-end results where I replicate the audio samples of the SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/
I use the reference audio from VCTK p240 and p260 to create the embedding and generate synthesized samples #0 and #1 using tacotron and the vocoder model.
@Oktai15 I thought I posted the links to the encoder for the mixed version. The tacotron and vocoder weights are useless that I trained. However the encoder is quite good. https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0
Thanks for publishing the code and basic training instructions!
Environment
Datasets: (9,063 speakers)
I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.
Hyper Parameters: Left all parameters untouched.
Encoder training:
39,300 steps:
115,900 steps: (almost exactly 24 hours of training)
Typical step
Questions