Closed Ca-ressemble-a-du-fake closed 1 year ago
I think the problem is mostly the vocoder model. FastSpeech produces a spectrogram and the spectrogram is then transformed to a wave using a vocoder model called Avocodo. VITS produces a wave directly. The approach with individual modules that we use here is better suited for low-resource scenarios and offers more controllability, but the quality is overall a bit worse. My goal in the moment is to build very very multilingual TTS, so I'm trading in some quality for the other benefits. I plan to revisit the vocoder training / architecture and try to improve it, but for now I think your model is pretty close to how good it can be with this architecture. I'm pretty busy at the moment with teaching because it's the middle of the semester, but once I have some more time I'll revisit the vocoder and try to scale up the parameter counts and reduce the amount of upsampling that it does to make the task easier (48kHz might be overkill, 24kHz would probably be enough)
Got it thanks for taking time to answer! Teaching first!
+1 for 24kHz instead of 48kHz as the final sample rate, for speech there shouldn't be an audible difference.
there might be a very slight audible difference I think. For 24kHz the Nyquist frequency is at 12kHz, so there might be some aliasing in the audible range. For human speech recorded at 24kHz vs. recorded at 48kHz there might be a small but audible difference. With all of the flaws of the vocoded speech however, I think that there will be fewer flaws with a reduced complexity, even though that lowers the theoretical upper bound for quality.
I tried this in a branch made for testing, where I also test another TTS architecture that should in theory sound a bit better. It's highly unstable, full of bugs and for the TTS itself the code isn't even complete, but it's slowly being worked on :)
Would fine-tuning the vocoder improve quality?
From what I have heard so far, finetuning the vocoder on a certain speaker does not yield much improvement, however finetuning the vocoder on the spectrograms produced by the TTS rather than the spectrograms of ground truth speech can be very beneficial.
I haven't implemented such an end-to-end finetuning component yet, but I plan to add this at some point. Essentially just going from the text to the spectrogram and then further to the wave and continuing training with the GAN setup. Not sure if freezing the spectrogram generator part or updating it fully end to end would be better.
finetuning the vocoder on a certain speaker does not yield much improvement
I just tried this and didn't notice much improvement
however finetuning the vocoder on the spectrograms produced by the TTS rather than the spectrograms of ground truth speech can be very beneficial.
So It means I have to some thing like this:-
Is this the right order?
Yes, having exactly matching spectrograms and waves for point 3 is not so simple though. I started implementing this in an experimental branch. I was sick for a while, so the next version is delayed by a few weeks, but I hope I can get this functionality done relatively soon. I heard from colleagues that this end-to-end finetuning actually has a much bigger impact than I expected. The upcoming release will focus a lot on quality.
@Flux9665 Glad to hear you recovered and you're fit again ! Looking forward to test the upcoming version when it is ready (by now I am experimenting a lot on Coqui YourTTS, so I will be able to better understand Toucan when it's ready). Keep on the good work😊
Today's release makes all of the experimental changes stable, which should have some improvements in quality. The biggest change is that the PostNet part of the spectrogram synthesis is now a normalizing flow, as in the PortaSpeech architecture. The harmonics in the spectrogram definitely go up to much higher frequencies than they did before. Also the vocoder now uses 24kHz instead of 48kHz, which produces a lot less artifacts. The variance of the prosody is still pretty flat mostly and there are some more things I want to try, so the next release might bring some more improvements with respect to the naturalness. But for now, all experimental changes are stable and new models are provided.
Hi,
I have been playing around with Toucan TTS for some times and it is really easy to use and training is fast. I finetuned the provided Meta pretrained model with a 8 hour dataset and the result is not as good as I was expecting. So I wonder if I could make it even better or if you could help me spot where the "problem" lies in the generated audio :
Here are the waveforms (top is Coqui VITS with 260k step trained from scratch model, bottom is Toucan FastSpeech2 with 200k step trained from Meta model) :
The associated spectrograms :
And the audios :
This is from Coqui VITS, I find it crystal clear : https://user-images.githubusercontent.com/91517923/202889766-0c2ad9ad-2ec2-4376-9abc-17a008e58364.mp4
This is from FastSpeech2. It sounds as on old tapes, the voice is like shivering (I don't know it that's the right terms!) https://user-images.githubusercontent.com/91517923/202889734-3a02486d-3785-4e83-8365-614c6ac0f64f.mp4
Both generated audios have been compressed to mp4 to be able to post them, but they are pretty close to what the wavs sound like (to my hearing there is no difference).
So how can I make Toucan FastSpeech2 model sound better ? Should I train it some more steps or should on the contrary is it over-trained / over-fitted ? Or the only way would be to implement VITS in Toucan (I don't think it is straight forward to do) ?
Thank you in advance for helping me improve the results!