Closed ireneb612 closed 2 years ago
There is no reason the models can't be substituted with more recent ones. Just need someone to work on it.
There is no reason the models can't be substituted with more recent ones. Just need someone to work on it.
If you can tell me how to do it in detail, I can try it.
I will provide guidance to those who can work independently and will contribute the results as open source.
First task is to get our toolbox to use NVIDIA's pretrained Tacotron2 in place of our synthesizer. The result of this step will be a single voice TTS since the NVIDIA model does not support voice cloning.
By the way, this is very similar to how I started working on #472.
How should the architecture of the Tacotron 2 be modified to be able to clone a voice?
We'll get to that when the above task is complete.
okay, I will write tomorrow.
It's just that I already tried to implement taco 2 myself, and now it is learning, but it seems to me that I did something wrong, because my alignments in the 40k range look something like this I took nvidia's implementation of Tacotron 2 and added the addition of a speaker vector in front of the attention layer. For training I use LibriTTS 100 + 360, HiFi TTS, VCTK
It's just that I already tried to implement taco 2 myself
I was not aware of this. If you have made progress on integrating Tacotron2 into this repo, share your code and we can start from there.
The approach I suggest is to first try the pretrained models in the toolbox UI to make sure everything is integrated properly. Once that is working, it is a good foundation for remaining work.
I didn’t quite use this repository and didn’t integrate into it, but I was inspired by it. I've been researching your repository for a very long time. Now I will tell you how I implemented everything. In the second tacotron, the input to the attention layer has dimension 512, I concatenate to it the speaker embedding vector of dimension 256 and after that I apply the linear layer 768 -> 512 I cannot share the complete code with my company policy
In the second tacotron, the input to the attention layer has dimension 512, I concatenate to it the speaker embedding vector of dimension 256 and after that I apply the linear layer 768 -> 512
Try it without the "linear layer 768 -> 512". That is introducing an information bottleneck which is making it difficult to learn attention.
@e0xextazy I am going to mark the previous set of posts as off-topic. If you are able to contribute to this repo, it would be appreciated. If not, I understand.
Maybe we should create some kind of space where we can communicate? I think the maximum I can share is a model, weights and a fairly detailed guide(without extending the code of this repository), but this still needs to be discussed with the company's management.
I propose that we use NVIDIA's Tacotron2 and train a voice cloning model from scratch on public datasets. Under these conditions, would your company need to be involved at all?
I have some exciting news to share, Nvidia's Tacotron2 has been integrated into my repo: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/878_tacotron2
It is almost done, it just needs some modifications in the decoder to account for the speaker embedding. @e0xextazy If you are able, would you please share the changes you made to save me some time?
Both inference and training have been tested and work, but require CUDA for now. I am currently training a single-speaker model to make sure it can learn attention. After that, I will work on a pretrained model using LibriSpeech.
Training outputs at 7500 steps. It has learned attention, but the mel prediction is not as good as Tacotron1 at a similar stage. Perhaps the implementation of reduction factor can be improved.
If you look carefully at the predicted spectrogram for hparams.n_frames_per_step=7
above, you'll see some artifacts in the spectrogram. I don't have an explanation for this. But it does not happen when hparams.n_frames_per_step=1
, after a sufficient number of steps have been trained.
@e0xextazy My implementation of SV2TTS in Nvidia Taco2: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/f9ad04373ffc0309295eb36ca73f8c27143cf2e6
I have some exciting news to share, Nvidia's Tacotron2 has been integrated into my repo: https://github.com/blue-fish/Real-Time-Voice-Cloning/tree/878_tacotron2
It is almost done, it just needs some modifications in the decoder to account for the speaker embedding. @e0xextazy If you are able, would you please share the changes you made to save me some time?
Both inference and training have been tested and work, but require CUDA for now. I am currently training a single-speaker model to make sure it can learn attention. After that, I will work on a pretrained model using LibriSpeech.
What do you want me to provide you with? I do not understand the request.
Dear blue-fish, could you give more advice on preparing data for training? Maybe you can combine some datasets together for better results and so on. Can you tell me briefly what has been done and what else needs to be done in your opinion?
Both inference and training have been tested and work, but require CUDA for now. I am currently training a single-speaker model to make sure it can learn attention. After that, I will work on a pretrained model using LibriSpeech.
One question that interests me is why are you going to use LibriSpeech and not LibriTTS? Are they different in some way?
What do you want me to provide you with? I do not understand the request.
At the time, I wanted your code changes to add the speaker embedding. But after I finished checking the model with a single speaker dataset, I went ahead and made the code change myself in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/878#issuecomment-968587719 .
One question that interests me is why are you going to use LibriSpeech and not LibriTTS? Are they different in some way?
LibriTTS is the better dataset, because the utterances are shorter and the transcripts contain punctuation. However, the punctuation does make it a little harder for the model to learn. I have already done a lot of experiments with LibriSpeech so I start with that one to benchmark the model against others I have developed.
Dear blue-fish, could you give more advice on preparing data for training? Maybe you can combine some datasets together for better results and so on. Can you tell me briefly what has been done and what else needs to be done in your opinion?
I am still trying to get a good TTS based on a single dataset: matching the results in the Tacotron 1/2 and SV2TTS papers. There is still a quality difference between our synthesizer and the one that Google demonstrated 3 years ago.
Then I'll try to start a LibriTTS training tomorrow and I can keep you informed if you're interested. P.S. But I want to teach a 22050 sample rate
@blue-fish Can I contact you outside of Github?
@e0xextazy Please contact me here (on Github). Keep in mind that I only provide assistance for open-source projects, and only as my time and interest allow.
This is what I have at the moment. I am also attaching the hyperparameters of training. Is there any way you could comment on the learning process? hparams.txt I can't attach the archive as a file (some internal error on github), so I'm sharing a link to google drive where you can download it 59500_steps.tar.gz My Mel-spectrograms look very bad, what can this be due to?
My Mel-spectrograms look very bad, what can this be due to?
I have been comparing my LibriSpeech training plots to Taco2 with the old Tensorflow repo and quality is the same. When compared to Taco1, the training mels look comparatively worse. But Taco2 inference quality is still good.
How much do you think a model needs to be trained? for example on a LibriTTS dataset consisting of 460 hours of audio data. perhaps it is a certain number of steps or some value of the error function, at which the output will be good. What else do you think about LibriTTS data processing? it might be worth changing the current pipeline. If my results are good, I can share the model weights for a sampling rate of 22050.
I am trying to train a synthesizer for further work with the HiFiGAN vocoder.
Why not substituting the current models with later versions one? Both Tacotron2 and Waveglow are implemented in pytorch.