CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.62k stars 8.8k forks source link

Training from scratch #126

Closed sberryman closed 4 years ago

sberryman commented 5 years ago

Thanks for publishing the code and basic training instructions!

Environment

Datasets: (9,063 speakers)

I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.

Hyper Parameters: Left all parameters untouched.

Encoder training:

39,300 steps: image

115,900 steps: (almost exactly 24 hours of training) image

Typical step

Step 115950   Loss: 0.9941   EER: 0.0717   Step time:  mean:   889ms  std:  1320ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  449ms   std: 1317ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:    8ms   std:    2ms
  Loss (10/10):                                    mean:   67ms   std:    7ms
  Backward pass (10/10):                           mean:  237ms   std:   26ms
  Parameter update (10/10):                        mean:  118ms   std:    3ms
  Extras (visualizations, saving) (10/10):         mean:    6ms   std:   18ms

Questions

  1. Will adding an additional ~2,900 speakers make much of a difference for the encoder?
    1. Will adding the remaining LibriTTS datasets (train-clean-100, train-clean-360, dev-clean, dev-other) with 1,221 speakers have any adverse effects training the synthesizer and vocoder?
  2. Does using different languages in the encoder help or hurt?
  3. Does my encoder training thus far look okay? It appears it will take me roughly 7 days to train the encoder up to 846,000 steps.
  4. Can I train the encoder using 16,000Hz while training the synthesizer and vocoder using 24,000Hz? Or do I need to restart and train the encoder on 24,000Hz mel spectrograms?
  5. I've downloaded the source videos for TEDLIUM-3 so I can extract audio at up to 44,100Hz allowing me to expand the synthesizer and vocoder training dataset to TEDLIUM + LibriTTS at 24,000Hz.
  6. Based on other issues I've read it appears you would like to use factchord taco1 implementation. Would you advice I go that route vs nvidia's taco2 pytorch implementation?
railsloes commented 5 years ago

@sberryman I´m training a synthesizer on librispeech on the unmodified code from 0 and after 10k steps got loss around 10. Yours seems to be around 0.8 Is it so?

sberryman commented 5 years ago

Hi @railsloes,

Based on the graphs above you are correct that the loss was around 0.8 by 10k steps. I used LibriTTS not the LibriSpeech dataset CorentinJ trained on. The difference being that the audio sample rate was 24 kHz in LibriTTS (along with a few other differences.)

I was NOT able to obtain alignment using the LibriTTS dataset while training the synthesizer.

railsloes commented 5 years ago

@sberryman Thank you very much!! It was my fault. I was training a modified version with a bug on the gradients.

geekboood commented 5 years ago

Hi guys, I try to train the encoder on a mandarin dataset and encounter a problem. Can you guys take a look at this? https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/192

WenjianDing commented 4 years ago

@shawwn - Have you tried the models yet? I was just doing some testing and every voice I tried to clone sounded the same. Wondering if you experienced the same? (They all sounded robotic and female)

My assumption is the synthesizer and vocoder didn't train properly as I'm able to cluster voices using the encoder.

@sberryman Hi, did you solve the problem that your vocoder model clone the sounds all female, I have meet the same problem?

sberryman commented 4 years ago

@WenjianDing I have not attempted to train the vocoder and synthesizer again so I have not solved the problem with all the voices sounding the same/female. I'm more focused on the encoder for speaker diarization.

natravedrova commented 4 years ago

@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!

Encoder

https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0

Synthesizer (Tacotron)

https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0

Vocoder

https://www.dropbox.com/s/bgzeaid0nuh7val/vocoder.zip?dl=0

@sberryman thanks a lot for the models. Could you share respective parameter settings as well? I mean the following three files: encoder\params_model.py, synthesizer\hparams.py, vocoder\hparams.py. You mentioned some of the parameters in the thread, but it's not clear which of them have to be applied when using these models.

Liujingxiu23 commented 4 years ago

@sberryman Can you share some images generated by the tools "Resemblyzer", like follows. I downloaded the pretrained model offered by CorentinJ, and finetune with chinese corpus (5000 speakers) with lr-0.00001, the the embedding seems not very good even though the loss becomes to 0.005. Can you share your result for reference?

sberryman commented 4 years ago

@Liujingxiu23 Apologize for the delayed response as I was out of town but the plots look okay to me. How many additional steps did you finetune? It looks like it could be trained longer. Also take a look at this issue as I made a ton of comments and posted a lot of plots. https://github.com/resemble-ai/Resemblyzer/issues/13

Liujingxiu23 commented 4 years ago

@sberryman Thank you very much for your reply. The similarity image may be somewrong, I finetune the model again with chinese corpus(9000 speakers) and tested on different steps. Though the similar values get better with step increase, the speed is really slow.

Utterance Same - Median: 0.757(1.5675M) 0.768(1.8075M) 0.712(2.07M) 0.701(2.3625M) Different - Median: 0.916(1.5675M) 0.910(1.8075M) 0.921(2.07M) 0.920(2.3625M)

Speaker Same - Median: 0.823(1.5675M) 0.832(1.8075M) 0.784(2.07M) 0.770(2.3625M) Different - Median: 0.979(1.5675M) 0.980(1.8075M) 0.976(2.07M) 0.976(2.3625M)

I guess training from scratch may be a better choice.

You have had so many disscusion at https://github.com/resemble-ai/Resemblyzer/issues/13, I am tring to train a new model like yours, the loss and err decream much faster. Thank you so much!

By the way, since the SV2TTS paper use 18k speakers, you have more speakers then, I guess you many get a good encoder? Then have you got some good end-to-end results, I mean synthesis wavs of speaker unseen.

sberryman commented 4 years ago

@Liujingxiu23 So glad to hear that the Resemblyzer thread has helped you! @CorentinJ has been incredibly helpful answering my questions.

There have been quite a few people asking for a Chinese embedding, if you are able to post a link to your trained model I'm sure it would be helpful to quite a few people.

I have been experimenting on a completely new model for the embedding and am making a lot of progress. I've been using quite a few languages on in same model (including Chinese from http://openslr.org/82/ but that is only 1,000 speakers) Right now I'm training using 37,606 speakers of which a little more than 50% are English. Is the 9,000 speaker Chinese dataset you are using available for download? I'm always trying to add more speakers from different languages.

I have been focused on speaker embedding not the full pipeline, I've only attempted to train the vocoder and synthesizer once and didn't have the best success.

Liujingxiu23 commented 4 years ago

@sberryman About 2600 speakers can download from http://openslr.org/resources.php , you can use key word "Chinese" to find them, for example, SLR38, SLR68. I am sorry I can not share other datasets and the model. And I did not use SLR82, I can't remember the exact reason, maybe the wavs in this dataset have loud background music.

The training of the encoder model is so time consuming,I cant wait to train a synthesizer model (when the loss is about 0.01), but the result is not good. Maybe I should wait for more days.

sberryman commented 4 years ago

@Liujingxiu23 I've already included SLR68 in my training dataset as well as SLR82. You are correct on SLR82, the audio has a lot of background noise (music, sound effects, other people talking, etc.) From what I remember based on conversations with CorentinJ and the paper, background audio is not bad for the encoder training. In fact it helps the encoder as it learns to focus on the spoken audio.

I completely agree, I've probably spent well over 90 days on various experiments related to encoder training. Right now I'm focusing on adding random noise to the speakers to make a more robust encoder. Training has been very tricky on though.

Liujingxiu23 commented 4 years ago

@sberryman Have you tried other features or other version of SV.

For the feature, the SV paper says they use "40-dimension log-mel-filterbank energies as the features for each frame" . This may differ from the feature we use now. I cannot judge how much influence about this diff.

For the SV, I am tring to run "https://github.com/Janghyun1230/Speaker_Verification/". I tring different learning rate now. However, the learning is also very slow, it seems hopeless and not helpfull to me now.

sberryman commented 4 years ago

@Liujingxiu23 I have not tried adjusting the mel spectrogram features. Personally, I have a feeling using features (spectrogram) as input to the model can be avoided... At least for my use case.

My motivations for the speaker encoder are not inline with replicating voices, I'm more interested in using the embedding for speaker diarization.

Right now I'm training a completely new model on English using 22,553 speakers.

sheelpatel commented 4 years ago

Quick question. So if I want to make this speech generation work well on a specific person do I need to train it on a bunch of their annotated speech. Or can I just use the pretrained weights with a bunch of their recordings. Or am I looking for a completely different network architecture??

sheelpatel commented 4 years ago

@sberryman I'd be interested in hearing more samples.

In my experience, target=16000 overlap=800 produces high quality pop-free audio. I used it to make Dr. Kleiner sing: https://www.reddit.com/r/HalfLife/comments/d2rzf0/deepfaked_dr_kleiner_sings_i_am_the_very_model_of/

how did you train this exactly? I am interested in doing something similar

LordBaaa commented 4 years ago

@sberryman I downloaded your models but I get this error when trying to load them. I tried using your fork but still doesn't load. Any recommendations? Thanks

model_hidden_size = 256 model_embedding_size = 256 image

model_hidden_size = 768 model_embedding_size = 768 image

sberryman commented 4 years ago

@LordBaaa , did you adjust /encoder/params_model.py? It is complaining about a mismatch of dimensions from what the model expects and what is in the weights file.

LordBaaa commented 4 years ago

Well like I said I changed the model_hidden_size & model_embedding_size in encoder/params_model.py To 768 Like you had mentioned in the past (I read through all the old comments). Should I change anything else, what am I missing?

LordBaaa commented 4 years ago

@sberryman (I didn’t put @ at the front last time so I figured maybe you didn’t see my response) Also I have been working on training a synth model. I notice the synth and vocoder are the least trained in the pretrained models. I have been gathering processing datasets with the pre-process. I want to add in some other languages from common voice to. Sometimes I need to stop the preprocessing but when I do it has to do it all from scratch next time. Is there a function I don’t know about to pause/save progress or do I need to see about adding one? Thanks

sberryman commented 4 years ago

@LordBaaa I apologize, I did miss your last response. As far as testing the encoder, that should be all you would need to change. I've never used any of CorentinJ's GUI features so I'm not sure how my encoder model will work there.

I did train a synthesizer and vocoder model based on my encoder weights but it didn't work out very well. My focus has been on the encoder only, unfortunately I am not much help on the other two components.

I've also used the entire Common Voice dataset as part of training (every language available) which is a great resource. I think I've managed to compile various datasets to get my unique speaker count up to just over 39,000.

I also had a lot of conversations with CorentinJ over on his Resemblyzer repository for Resemble AI. https://github.com/resemble-ai/Resemblyzer/issues/13

There is a feature to "resume" the pre-processing step. If you look in the encoder_preprocess.py file you'll see a command line argument called --skip_existing which will skip over files which have already been processed. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder_preprocess.py#L37-L39

LordBaaa commented 4 years ago

@sberryman Thanks, yeah --skip_existing is in synthesizer preprocess as well. Functionally though I think it does the work and then checks if it exists. I was looking at how some of the code worked (didnt get super deep so I may be wrong) but mainly I say this cause it will continue to use up CPU on like 1/900 even though I know it already exists. Plus it would still be nice to have the ability to pause it as opposed to having to wait or completely close the window

sberryman commented 4 years ago

@LordBaaa The skip existing code works, I used it many MANY times. Look at these two lines, they immediately proceed the preprocess_wav function which processes each source file. https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/preprocess.py#L93-L94

There is a ton of room for improving the experience but this is research code. And quite frankly, it is one easiest to read and most well documented research projects I've come across.

bmccallister commented 4 years ago

This is an amazing effort by you guys. Thank you for all the assistance.

I am trying to get the latest models from sberryman working but get the following output:

Arguments: enc_model_fpath: encoder/saved_models/pretrained.pt syn_model_dir: synthesizer/saved_models/logs-pretrained voc_model_fpath: vocoder/saved_models/pretrained/pretrained.pt low_mem: False no_sound: False

Running a test of your configuration...

Found 1 GPUs available. Using GPU 0 (GeForce GTX 980M) of compute capability 5.2 with 4.2Gb total memory.

Preparing the encoder, the synthesizer and the vocoder... Loaded encoder "pretrained.pt" trained to step 2152001 Found synthesizer "pretrained" trained to step 324000 Building Wave-RNN Trainable Parameters: 4.481M Loading model weights at vocoder/saved_models/pretrained/pretrained.pt Traceback (most recent call last): File "demo_cli.py", line 63, in vocoder.load_model(args.voc_model_fpath) File "/home/lucidz/Documents/projects/sberryman/Real-Time-Voice-Cloning/vocoder/inference.py", line 31, in load_model _model.load_state_dict(checkpoint['model_state']) File "/home/lucidz/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 839, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for WaveRNN: size mismatch for upsample.up_layers.5.weight: copying a param with shape torch.Size([1, 1, 1, 25]) from checkpoint, the shape in current model is torch.Size([1, 1, 1, 17]).

Any thoughts on this? \

LordBaaa commented 4 years ago

@bmccallister Yeah I don’t know at this point. If you see my issue it’s very similar but with different torch.Size number problem. This is basically what I have been facing

bmccallister commented 4 years ago

@bmccallister Yeah I don’t know at this point. If you see my issue it’s very similar but with different torch.Size number problem. This is basically what I have been facing

It appears to me that the size of the model that was used in the synthesizer does not match against what is expected in the vocoder.

I had to make assumptions with the vocoder as there was no checklist file in the tacotron pretrained zip file so I copied one I had to make it get past that part of the script.

Have you or anyone else been able to build any other models and publish them anywhere?

I’ll also say I did a search in every parameter I could for the 25 value - to see if I could change it to 17 and nothing worked - this is why I think it must be an issue with the relationship to one of his other supplies pretrained models.

LordBaaa commented 4 years ago

@bmccallister #257 I haven't tried them myself doe

sberryman commented 4 years ago

I don’t think I ever released my synthesizer or vocoder models which were trained on my encoder. They were so poor that I trashed them. Maybe I did and forgot but I wouldn’t recommend using them if I did.

natravedrova commented 4 years ago

@LordBaaa, @bmccallister this is what I changed when trying all three trained models Shaun shared in the thread. Resulting quality was poor indeed. Please try to change models' settings as follows:

encoder/params_model.py

model_hidden_size = 256
model_embedding_size = 768

synthesizer/hparams.py

speaker_embedding_size=768

vocoder/hparams.py
Just add these lines at the end of the file

n_fft=2048
hop_size=300
win_size=1200
sample_rate=24000
speaker_embedding_size=768
voc_upsample_factors=(5, 5, 12)
bmccallister commented 4 years ago

@LordBaaa, @bmccallister this is what I changed when trying all three trained models Shaun shared in the thread. Resulting quality was poor indeed. Please try to change models' settings as follows:

encoder/params_model.py

model_hidden_size = 256
model_embedding_size = 768

synthesizer/hparams.py

speaker_embedding_size=768

vocoder/hparams.py Just add these lines at the end of the file

n_fft=2048
hop_size=300
win_size=1200
sample_rate=24000
speaker_embedding_size=768
voc_upsample_factors=(5, 5, 12)

Thank you for this info! So you said that after these pajama quality was bad? Worse than the pt models provided by corentin?

Did you have any success finding a model that worked better than corentins?

I’m sure we will figure this out!! :)

bmccallister commented 4 years ago

I don’t think I ever released my synthesizer or vocoder models which were trained on my encoder. They were so poor that I trashed them. Maybe I did and forgot but I wouldn’t recommend using them if I did.

Thank you for your response!

So should we not bother using the other samples you provided up thread?( it looks like you deleted the comment you had with the links but natravedova wuoted you so you can find your links posted up thread.

Are the encoder synth and vocoder not linked and the models not needed to be done sequentially?

May I ask sberryman what your highest level of success has been and if you have tips for repeating it or perhaps we could begin a repo to host pretrained models we have all worked on?

sberryman commented 4 years ago

@bmccallister I was extremely happy with the encoder model I've trained. Although if I were to retrain a new model from scratch I would use 256 as the embedding dimension and leave 768 hidden units. I would have also replaced the ReLU activation with Tanh as Corentin mentioned in this thread or the one on Resemblizer.

They are linked. If you make any changes to the encoder you need to re-train everything downstream.

Since my focus was never to recreate a voice I never spent much time on the synthesizer or vocoder. If I were to attempt multispeaker synthesis right now, I would be using mellotron from nvidia as my base. https://github.com/NVIDIA/mellotron

natravedrova commented 4 years ago

@LordBaaa, @bmccallister this is what I changed when trying all three trained models Shaun shared in the thread. Resulting quality was poor indeed. Please try to change models' settings as follows: encoder/params_model.py

model_hidden_size = 256
model_embedding_size = 768

synthesizer/hparams.py

speaker_embedding_size=768

vocoder/hparams.py Just add these lines at the end of the file

n_fft=2048
hop_size=300
win_size=1200
sample_rate=24000
speaker_embedding_size=768
voc_upsample_factors=(5, 5, 12)

Thank you for this info! So you said that after these pajama quality was bad? Worse than the pt models provided by corentin?

It was worse than the default pt models. All voices sounded very similar, there was no difference between male and female voices. Though there is a chance that I did something wrong.

Did you have any success finding a model that worked better than corentins?

Unfortunately not.

gdineshk6174 commented 4 years ago

@sberryman hello , my name is Dinesh, i plan to generate english audio but in indian accent so i started training the model from scratch starting with encoder. the encoder is doing good but im stuck with synthesizer as i dont have time-aligned transcript of audio files. so i thought i could download pretrained synthesizer and pretrained vocoder and generate audio. it did generate audio from sample voice but it still has american accent. on reading CorentinJ's thesis more carefully i came to know that wavenet is responsible for naturalness in generated voice. so now i'm planning to train only the vocoder on mel- spectrograms generated from downloaded pretrained synthesizer. do you think this works? and if it does , how should i proceed. i would really appreciate it if you could give any insight on how to tackle this problem.

sberryman commented 4 years ago

@gdineshk6174 Hi Dinesh!

I'm not an expert and I failed to generate a good synthesizer and vocoder model so anything I say, please don't take it as fact. You should be able to use the pretrained encoder and fine-tune it on your Indian accent dataset (most likely won't require much fine tuning, may not require any.) Once the encoder is producing tight, easily distinguishable clusters for each speaker you can move on to the synthesizer. The most important thing from what I've read on the synthesizer/vocoder is to have clean audio. Meaning you don't want background noise in the audio. You'll also want quite a bit of training data, this is usually the hardest part.

I never thought about skipping the encoder and synthesizer and jumping straight to the vocoder using the pre-trained models. You can try it and see how it performs, would be interesting if it works and produces high quality speech. Hopefully you have plenty of GPUs available and lots of time, training and running experiments takes quite a bit of time.

bmccallister commented 4 years ago

Sberryman - thank you again for all the help and responses in this thread. Really nice of you to take your time.

I've read through a good portion of https://puu.sh/DHgBg.pdf to try to understand how all this works.

It does appear that the encoder creates the embedding, the synthesizer uses this to build the spectrogram and the vocoder outputs the waveform.

It occurs to me these processes are sequential and linked. Would it be possible to start with your heavily trained encoder, and then hook up to arbitrary datasets for the syntheszer and vocoder?

IE: Can i start the process with your pretrained encoder and then move on to synth and vocoder after?

My goal is to produce multispeaker (single speaker is honestly ok) english with no accent at all. It seems like that should be relatively simple, but i continue running into issues combining pretraining models (size / scale mismatches) etc.

I've also looked at the nvidia mellotron, but when i started working to get the project to work - i had some python mismatches which made me afraid i might never get the corentin project to run again if i messed with it :)

sumuk commented 4 years ago

@sberryman hi, you trained encoder module for speaker verification task. Have you benchmarked your model with any dataset? if you have, could you share your benchmark results and dataset used for benchmarking? I have benchmarked pre-trained model on the voxceleb1 dataset and results are not looking good. I am getting EER of 8%.

mueller91 commented 4 years ago

@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!

Encoder

https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0

Synthesizer (Tacotron)

https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0

Vocoder

https://www.dropbox.com/s/bgzeaid0nuh7val/vocoder.zip?dl=0

Dear All, i've downloaded the models from @sberryman and adapted the hyper parameters accordingly. I created a few examples with them. I observe the following: 1) the sound quality is pretty good (clearly understandable, no bleeps or blops etc.) 2) the voice does not resemble the reference embedding. it's like a 'generic' voice.

I wonder why that is. Did anybody else experience this? Thanks!

ghost commented 4 years ago

Encoder: trained 1.56M steps (20 days with a single GPU) with a batch size of 64 Synthesizer: trained 256k steps (1 week with 4 GPUs) with a batch size of 144 Vocoder: trained 428k steps (4 days with a single GPU) with a batch size of 100

I am trying to squeeze just a little more quality out of Corentin's pretrained models by continuing to train the vocoder while leaving the other models unchanged. This also seems like a reasonable place to start as I still have much to learn. Has anyone else tried this?

My GPU only has 4gb so I reduced the batch size from 100 to 50 to make it fit. I am otherwise using default parameters and the same training set as in the wiki. Loss is slowly but steadily decreasing, from 3.682 to 3.677 after 10 epochs. I'll continue the training and see if results are noticeably better.

mueller91 commented 4 years ago

Hi @blue-fish I think the vocoder is actually the strongest part. The synthesiser is what makes or breaks the model. If you look at the mfccs, you will notice that they are quite weird sometimes. For example they contain large pauses. If you want to improve the model, train a new Synthesizer and possibly a new encoder. I would suggest using mozillas TTS as a baseline, the code here is outdated. Also, use LibriTTS.

ghost commented 4 years ago

I've added another 600k steps to the pretrained vocoder. Loss started at 3.682 and is currently at 3.647. Though I hear an improvement in the samples produced during training, voice cloning results are unchanged. Is there a procedure to benchmark performance?

LordBaaa commented 4 years ago

Hey @blue-fish do you plan to share your models and if so could I get them. Even if they are not finished training I’d be curious to hear the difference. Thanks. P.S I am unaware of a benchmark procedure.

ghost commented 4 years ago

Here are some samples @LordBaaa , can you hear the difference? I also provide a download link for the in-work model. No changes to hparams are needed to use it.

Samples: wavs.zip Model: https://www.dropbox.com/s/2skjbec4d67q3zo/vocoder_1159k.pt?dl=0

LordBaaa commented 4 years ago

@blue-fish awesome thanks! It’s subtle but yes I can here a difference. Listening to both of the 428 vs 1159 I feel like I hear a slight amount of “background noise”. Like when some leaves there mic on continuous transmission and there is a little bit of like ambient noise. I hear it particularly on the male voice. It seems the improvement makes it “cleaner”. When the male voice stops in 428 there is an audio pop/drop. His noisiness I think is most notable on his last few words. In 1159 pop is gone, it is more continuous and the background noise is less or not there. Again like I say very subtle but it is better.

ghost commented 4 years ago

Thanks for the feedback @LordBaaa . I generated that sample five times on the 428k model trying to get that pop to go away, before I became convinced that it was a feature of the model.

Oktai15 commented 4 years ago

Hello @sberryman! Could you provide pretrained weights from https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-532400349 for Mixed version?

Liujingxiu23 commented 4 years ago

@blue-fish The wavs that you shared sounds good! Are the wavs just the result of vocoder, or an end2end results which using encoder to predict the embedding then using tacotron and vocoder model to synthesize?

ghost commented 4 years ago

@Liujingxiu23 They are end-to-end results where I replicate the audio samples of the SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/

I use the reference audio from VCTK p240 and p260 to create the embedding and generate synthesized samples #0 and #1 using tacotron and the vocoder model.

sberryman commented 4 years ago

@Oktai15 I thought I posted the links to the encoder for the mixed version. The tacotron and vocoder weights are useless that I trained. However the encoder is quite good. https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0