DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.43k stars 160 forks source link

Can I improve the generated output naturalness ? #59

Closed Ca-ressemble-a-du-fake closed 1 year ago

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

I have been playing around with Toucan TTS for some times and it is really easy to use and training is fast. I finetuned the provided Meta pretrained model with a 8 hour dataset and the result is not as good as I was expecting. So I wonder if I could make it even better or if you could help me spot where the "problem" lies in the generated audio :

Here are the waveforms (top is Coqui VITS with 260k step trained from scratch model, bottom is Toucan FastSpeech2 with 200k step trained from Meta model) : ToucanVsCoquiWaveforms

The associated spectrograms : ToucanVsCoquiSpectrograms

And the audios :

This is from Coqui VITS, I find it crystal clear : https://user-images.githubusercontent.com/91517923/202889766-0c2ad9ad-2ec2-4376-9abc-17a008e58364.mp4

This is from FastSpeech2. It sounds as on old tapes, the voice is like shivering (I don't know it that's the right terms!) https://user-images.githubusercontent.com/91517923/202889734-3a02486d-3785-4e83-8365-614c6ac0f64f.mp4

Both generated audios have been compressed to mp4 to be able to post them, but they are pretty close to what the wavs sound like (to my hearing there is no difference).

So how can I make Toucan FastSpeech2 model sound better ? Should I train it some more steps or should on the contrary is it over-trained / over-fitted ? Or the only way would be to implement VITS in Toucan (I don't think it is straight forward to do) ?

Thank you in advance for helping me improve the results!

Flux9665 commented 1 year ago

I think the problem is mostly the vocoder model. FastSpeech produces a spectrogram and the spectrogram is then transformed to a wave using a vocoder model called Avocodo. VITS produces a wave directly. The approach with individual modules that we use here is better suited for low-resource scenarios and offers more controllability, but the quality is overall a bit worse. My goal in the moment is to build very very multilingual TTS, so I'm trading in some quality for the other benefits. I plan to revisit the vocoder training / architecture and try to improve it, but for now I think your model is pretty close to how good it can be with this architecture. I'm pretty busy at the moment with teaching because it's the middle of the semester, but once I have some more time I'll revisit the vocoder and try to scale up the parameter counts and reduce the amount of upsampling that it does to make the task easier (48kHz might be overkill, 24kHz would probably be enough)

Ca-ressemble-a-du-fake commented 1 year ago

Got it thanks for taking time to answer! Teaching first!

tomschelsen commented 1 year ago

+1 for 24kHz instead of 48kHz as the final sample rate, for speech there shouldn't be an audible difference.

Flux9665 commented 1 year ago

there might be a very slight audible difference I think. For 24kHz the Nyquist frequency is at 12kHz, so there might be some aliasing in the audible range. For human speech recorded at 24kHz vs. recorded at 48kHz there might be a small but audible difference. With all of the flaws of the vocoded speech however, I think that there will be fewer flaws with a reduced complexity, even though that lowers the theoretical upper bound for quality.

I tried this in a branch made for testing, where I also test another TTS architecture that should in theory sound a bit better. It's highly unstable, full of bugs and for the TTS itself the code isn't even complete, but it's slowly being worked on :)

iamkhalidbashir commented 1 year ago

Would fine-tuning the vocoder improve quality?

Flux9665 commented 1 year ago

From what I have heard so far, finetuning the vocoder on a certain speaker does not yield much improvement, however finetuning the vocoder on the spectrograms produced by the TTS rather than the spectrograms of ground truth speech can be very beneficial.

I haven't implemented such an end-to-end finetuning component yet, but I plan to add this at some point. Essentially just going from the text to the spectrogram and then further to the wave and continuing training with the GAN setup. Not sure if freezing the spectrogram generator part or updating it fully end to end would be better.

iamkhalidbashir commented 1 year ago

finetuning the vocoder on a certain speaker does not yield much improvement I just tried this and didn't notice much improvement

however finetuning the vocoder on the spectrograms produced by the TTS rather than the spectrograms of ground truth speech can be very beneficial. So It means I have to some thing like this:-

  1. Fine tune custom voice (small dataset) on Meta
  2. Use fine tune text-2-mel model to generate spectrograms
  3. Fine tune vocoder with generated spectrograms in step 2?

Is this the right order?

Flux9665 commented 1 year ago

Yes, having exactly matching spectrograms and waves for point 3 is not so simple though. I started implementing this in an experimental branch. I was sick for a while, so the next version is delayed by a few weeks, but I hope I can get this functionality done relatively soon. I heard from colleagues that this end-to-end finetuning actually has a much bigger impact than I expected. The upcoming release will focus a lot on quality.

Ca-ressemble-a-du-fake commented 1 year ago

@Flux9665 Glad to hear you recovered and you're fit again ! Looking forward to test the upcoming version when it is ready (by now I am experimenting a lot on Coqui YourTTS, so I will be able to better understand Toucan when it's ready). Keep on the good work😊

Flux9665 commented 1 year ago

Today's release makes all of the experimental changes stable, which should have some improvements in quality. The biggest change is that the PostNet part of the spectrogram synthesis is now a normalizing flow, as in the PortaSpeech architecture. The harmonics in the spectrogram definitely go up to much higher frequencies than they did before. Also the vocoder now uses 24kHz instead of 48kHz, which produces a lot less artifacts. The variance of the prosody is still pretty flat mostly and there are some more things I want to try, so the next release might bring some more improvements with respect to the naturalness. But for now, all experimental changes are stable and new models are provided.