CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.72k stars 8.8k forks source link

tacotron2 and waveglow #878

Closed ireneb612 closed 2 years ago

ireneb612 commented 3 years ago

Why not substituting the current models with later versions one? Both Tacotron2 and Waveglow are implemented in pytorch.

ghost commented 3 years ago

There is no reason the models can't be substituted with more recent ones. Just need someone to work on it.

e0xextazy commented 3 years ago

There is no reason the models can't be substituted with more recent ones. Just need someone to work on it.

If you can tell me how to do it in detail, I can try it.

ghost commented 3 years ago

I will provide guidance to those who can work independently and will contribute the results as open source.

First task is to get our toolbox to use NVIDIA's pretrained Tacotron2 in place of our synthesizer. The result of this step will be a single voice TTS since the NVIDIA model does not support voice cloning.

By the way, this is very similar to how I started working on #472.

e0xextazy commented 3 years ago

How should the architecture of the Tacotron 2 be modified to be able to clone a voice?

ghost commented 3 years ago

We'll get to that when the above task is complete.

e0xextazy commented 3 years ago

okay, I will write tomorrow.

e0xextazy commented 3 years ago

It's just that I already tried to implement taco 2 myself, and now it is learning, but it seems to me that I did something wrong, because my alignments in the 40k range look something like this Снимок экрана 2021-11-02 в 14 07 41 Снимок экрана 2021-11-02 в 14 07 52 Снимок экрана 2021-11-02 в 14 08 13 I took nvidia's implementation of Tacotron 2 and added the addition of a speaker vector in front of the attention layer. For training I use LibriTTS 100 + 360, HiFi TTS, VCTK

ghost commented 3 years ago

It's just that I already tried to implement taco 2 myself

I was not aware of this. If you have made progress on integrating Tacotron2 into this repo, share your code and we can start from there.

The approach I suggest is to first try the pretrained models in the toolbox UI to make sure everything is integrated properly. Once that is working, it is a good foundation for remaining work.

e0xextazy commented 3 years ago

I didn’t quite use this repository and didn’t integrate into it, but I was inspired by it. I've been researching your repository for a very long time. Now I will tell you how I implemented everything. In the second tacotron, the input to the attention layer has dimension 512, I concatenate to it the speaker embedding vector of dimension 256 and after that I apply the linear layer 768 -> 512 I cannot share the complete code with my company policy

ghost commented 3 years ago

In the second tacotron, the input to the attention layer has dimension 512, I concatenate to it the speaker embedding vector of dimension 256 and after that I apply the linear layer 768 -> 512

Try it without the "linear layer 768 -> 512". That is introducing an information bottleneck which is making it difficult to learn attention.

ghost commented 3 years ago

@e0xextazy I am going to mark the previous set of posts as off-topic. If you are able to contribute to this repo, it would be appreciated. If not, I understand.

e0xextazy commented 3 years ago

Maybe we should create some kind of space where we can communicate? I think the maximum I can share is a model, weights and a fairly detailed guide(without extending the code of this repository), but this still needs to be discussed with the company's management.

ghost commented 3 years ago

I propose that we use NVIDIA's Tacotron2 and train a voice cloning model from scratch on public datasets. Under these conditions, would your company need to be involved at all?

ghost commented 3 years ago

I have some exciting news to share, Nvidia's Tacotron2 has been integrated into my repo: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/878_tacotron2

It is almost done, it just needs some modifications in the decoder to account for the speaker embedding. @e0xextazy If you are able, would you please share the changes you made to save me some time?

Both inference and training have been tested and work, but require CUDA for now. I am currently training a single-speaker model to make sure it can learn attention. After that, I will work on a pretrained model using LibriSpeech.

ghost commented 3 years ago

Training outputs at 7500 steps. It has learned attention, but the mel prediction is not as good as Tacotron1 at a similar stage. Perhaps the implementation of reduction factor can be improved.

Tacotron2, r=7, steps=7500

image attention_step_7500_sample_1

Tacotron1, r=2, steps=7500

image

ghost commented 3 years ago

If you look carefully at the predicted spectrogram for hparams.n_frames_per_step=7 above, you'll see some artifacts in the spectrogram. I don't have an explanation for this. But it does not happen when hparams.n_frames_per_step=1, after a sufficient number of steps have been trained.

ghost commented 3 years ago

@e0xextazy My implementation of SV2TTS in Nvidia Taco2: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/f9ad04373ffc0309295eb36ca73f8c27143cf2e6

https://github.com/blue-fish/Real-Time-Voice-Cloning/blob/f9ad04373ffc0309295eb36ca73f8c27143cf2e6/synthesizer/models/tacotron2/model.py#L548-L575

e0xextazy commented 3 years ago

I have some exciting news to share, Nvidia's Tacotron2 has been integrated into my repo: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/878_tacotron2

It is almost done, it just needs some modifications in the decoder to account for the speaker embedding. @e0xextazy If you are able, would you please share the changes you made to save me some time?

Both inference and training have been tested and work, but require CUDA for now. I am currently training a single-speaker model to make sure it can learn attention. After that, I will work on a pretrained model using LibriSpeech.

What do you want me to provide you with? I do not understand the request.

e0xextazy commented 3 years ago

Dear blue-fish, could you give more advice on preparing data for training? Maybe you can combine some datasets together for better results and so on. Can you tell me briefly what has been done and what else needs to be done in your opinion?

e0xextazy commented 3 years ago

Both inference and training have been tested and work, but require CUDA for now. I am currently training a single-speaker model to make sure it can learn attention. After that, I will work on a pretrained model using LibriSpeech.

One question that interests me is why are you going to use LibriSpeech and not LibriTTS? Are they different in some way?

ghost commented 3 years ago

What do you want me to provide you with? I do not understand the request.

At the time, I wanted your code changes to add the speaker embedding. But after I finished checking the model with a single speaker dataset, I went ahead and made the code change myself in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/878#issuecomment-968587719 .

One question that interests me is why are you going to use LibriSpeech and not LibriTTS? Are they different in some way?

LibriTTS is the better dataset, because the utterances are shorter and the transcripts contain punctuation. However, the punctuation does make it a little harder for the model to learn. I have already done a lot of experiments with LibriSpeech so I start with that one to benchmark the model against others I have developed.

Dear blue-fish, could you give more advice on preparing data for training? Maybe you can combine some datasets together for better results and so on. Can you tell me briefly what has been done and what else needs to be done in your opinion?

I am still trying to get a good TTS based on a single dataset: matching the results in the Tacotron 1/2 and SV2TTS papers. There is still a quality difference between our synthesizer and the one that Google demonstrated 3 years ago.

e0xextazy commented 3 years ago

Then I'll try to start a LibriTTS training tomorrow and I can keep you informed if you're interested. P.S. But I want to teach a 22050 sample rate

e0xextazy commented 2 years ago

@blue-fish Can I contact you outside of Github?

ghost commented 2 years ago

@e0xextazy Please contact me here (on Github). Keep in mind that I only provide assistance for open-source projects, and only as my time and interest allow.

e0xextazy commented 2 years ago

This is what I have at the moment. I am also attaching the hyperparameters of training. Is there any way you could comment on the learning process? hparams.txt I can't attach the archive as a file (some internal error on github), so I'm sharing a link to google drive where you can download it 59500_steps.tar.gz My Mel-spectrograms look very bad, what can this be due to?

ghost commented 2 years ago

My Mel-spectrograms look very bad, what can this be due to?

I have been comparing my LibriSpeech training plots to Taco2 with the old Tensorflow repo and quality is the same. When compared to Taco1, the training mels look comparatively worse. But Taco2 inference quality is still good.

e0xextazy commented 2 years ago

How much do you think a model needs to be trained? for example on a LibriTTS dataset consisting of 460 hours of audio data. perhaps it is a certain number of steps or some value of the error function, at which the output will be good. What else do you think about LibriTTS data processing? it might be worth changing the current pipeline. If my results are good, I can share the model weights for a sampling rate of 22050.

I am trying to train a synthesizer for further work with the HiFiGAN vocoder.