DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.4k stars 158 forks source link

Why is bigvgan better than Avovodo ? #143

Closed Ca-ressemble-a-du-fake closed 3 months ago

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

I finetune Toucan Meta model on 1k on a reduced dataset to understand the difference between Avocodo and BigVGan.

Here are the spectrograms :

image

Apart from the 12kHz area which is a bit larger for the BigVGan version I can barely differences (I may not look at the right place). Where are the improvements ?

By the way why is there a dark strip around 12kHz ?

Thanks in advance for your explanations!

Flux9665 commented 1 year ago

You cannot see the difference in the spectrogram, because a spectrogram does not contain the phase shift information, which is what the vocoder tries to reconstruct. Since the input to the vocoder is already spectrogram, the spectrogram of the output will simply be again the same as the input ideally.

The improvement lies in the generator of BigVGAN, which has mechanisms built in to avoid aliasing during the upsampling process.

The area at 12kHz is due to the signal being 24kHz. So anything above the Nyquist frequency (half the sampling rate, i.e. 12kHz in this case) in a spectrogram is due to a problem called imaging and something that we want to avoid.

Ca-ressemble-a-du-fake commented 1 year ago

Thanks for your explanations. I tried to simply filter anything above 12 kHz but it did not sound better.