COLA == Training Instability?

gemelo-ai / vocos

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

https://gemelo-ai.github.io/vocos/

MIT License

771 stars 88 forks source link

COLA == Training Instability? #51

Open zaptrem opened 5 months ago

zaptrem commented 5 months ago

I'm training a Vocos decoder for my DAC autoencoder. When I set hop length = 256 and n_fft = 1024 in the iSTFT head the discriminators quickly win within 1000 steps. However, this doesn't happen when I set n_fft = 512, 768, or 1026. Do you know why this is happening and whether using 1026 would affect quality? I don't completely understand the COLA property.