I'm training a Vocos decoder for my DAC autoencoder. When I set hop length = 256 and n_fft = 1024 in the iSTFT head the discriminators quickly win within 1000 steps. However, this doesn't happen when I set n_fft = 512, 768, or 1026. Do you know why this is happening and whether using 1026 would affect quality? I don't completely understand the COLA property.
I'm training a Vocos decoder for my DAC autoencoder. When I set hop length = 256 and n_fft = 1024 in the iSTFT head the discriminators quickly win within 1000 steps. However, this doesn't happen when I set n_fft = 512, 768, or 1026. Do you know why this is happening and whether using 1026 would affect quality? I don't completely understand the COLA property.