CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.94k stars 8.81k forks source link

How to use other vocoders #1035

Closed Adibian closed 2 years ago

Adibian commented 2 years ago

Hi Thanks for this great project. I trained synthesizer using my data and used pretrained vocoder and the result was good. Now I want to use other vocoders. I cloned HiFi-GAN project and tried to run it with mel-spectrogram reached from synthesizer:

texts = ['my test here']
embeds = [embed] * len(texts)
specs = synthesizer.synthesize_spectrograms(texts, embeds)

with torch.no_grad():
    x = specs[0]
    x = torch.FloatTensor(x).to(device)
    y_g_hat = generator(x)
    audio = y_g_hat.squeeze()
    audio = audio * MAX_WAV_VALUE
    audio = audio.cpu().numpy().astype('int16')

But I got this error:

RuntimeError: Expected 3-dimensional input for 3-dimensional weight [512, 80, 7], but got 2-dimensional input of size [80, 527] instead

Then I tried to use WaveGlow and I got same error:

RuntimeError: Expected 3-dimensional input for 3-dimensional weight [80, 80, 1024], but got 2-dimensional input of size [80, 527] instead

Why result of synthesizer is two dimensional? I think that synthesizer should return mel-spectrogram and these vocoders use mel-spectrogram too, so using these vocoders should be possible but how? Thanks for any suggestion.

Adibian commented 2 years ago

I figured out what I should do to fix this error: x = x.unsqueeze(0) But the synthesized speech is very noisy! you can listen to result here. Thank you if anyone has experience using HiFi-GAN and say how to run it with the output obtained from the current project.

raccoonML commented 2 years ago

Your synthesizer is predicting spectrograms with a different scaling than the hifigan model expects. To fix this, you will need to retrain your model with properly scaled data. Replace your synthesizer/audio.py with this file and preprocess data again.

Adibian commented 2 years ago

Thanks for your quick answer. Which parameter should change? Is it possible to train hifi-gan with this parameters instead of retraining synthesizer?

Another question is that is there any standard parameters for this task? I mean that if I retrain synthesizer with new parameters, then which vocoders can I use (except hifi-gan that you say)? I should train synthesizer for each vocoder specifically ?

raccoonML commented 2 years ago

Is it possible to train hifi-gan with this parameters instead of retraining synthesizer?

Yes. Here's a pretrained model for testing: https://github.com/raccoonML/hifigan-demo/releases/tag/MLRTVC-v1 No training code provided, unfortunately.

I mean that if I retrain synthesizer with new parameters, then which vocoders can I use (except hifi-gan that you say)?

Waveglow should work.

Adibian commented 2 years ago

Your synthesizer is predicting spectrograms with a different scaling than the hifigan model expects. To fix this, you will need to retrain your model with properly scaled data. Replace your synthesizer/audio.py with this file and preprocess data again.

To retrain synthesizer I should just replace new audio.py file? It's not needed any changes in hparams.py or any other configuration in this project?

raccoonML commented 2 years ago

No updates to hparams.py are required. With the new audio.py, some settings no longer have an effect, like min_level_db and ref_level_db.

Adibian commented 2 years ago

@raccoonML I retrained synthesizer after replacing new audio.py and after ~100k steps (~35 epochs) result is better but it is still noisy (this file is a sample result) and loss is about 0.64 and does not seem to change very much after this. What do you think is the problem?

manuel3265 commented 2 years ago

Hello @raccoonML , sorry to bother you, I wanted to know if you found a solution for this. I hope you can answer me, thank you.

raccoonML commented 2 years ago

@manuel3265 Be more specific? If you're referring to this, it's not something I can help with since problems are often particular to the dataset used. For those types of problems, switch to LibriSpeech or LibriTTS train-clean-100/360 and see if it goes away. Since those datasets are known to work, it will help you determine whether it's a problem of your code or the training data.

manuel3265 commented 2 years ago

@raccoonML sorry for not being specific, i was referring to the use of hifi-gan with this repository.

were you able to implement it? or did you manage to implement some other vocoder?

raccoonML commented 2 years ago

I have an open issue to integrate hifigan with the MLRTVC fork. It's on hold for reasons: 1) lackluster results with the hifigan pretrained models (here); 2) more cleanup required before I'd feel comfortable releasing the code, and 3) my supporters have me working on voice conversion instead of TTS.

If this is something you want to try, I would suggest integrating Nvidia tacotron2 with hifigan since those repos use the same mel scaling. Another possibility is to train a hifigan model on RTVC mels like this example.

Edit: Add to list of reasons: 4) perceived lack of interest in continued development of RTVC or MLRTVC.