auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
983 stars 207 forks source link

AutoVC on a large scale data? #26

Open iyah4888 opened 4 years ago

iyah4888 commented 4 years ago

Hi @auspicious3000, thanks for sharing your research code. I've worked on a lot of time to make the training code work (mostly due to input hyper parameter issues as the other guys are also struggling). I'm currently working on the VoxCeleb2 dataset (near 6000 speaker, with 1M utterances). However, I cannot make it trainable with MSE loss, but with L1 loss, I can manage to get the following auto-encoding reconstruction.

[Original] image [Voice converted with another speaker embedding] image

The problem is while the network learns auto-encoding, but during the test time, it is not generalizable to voice conversion. It just did auto-encoding, not something else. The above pair of examples are voice conversion examples, where both fundamental frequency of the mel-spectrogram looks very similar.

Could you share your your experience or any comments? I'd appreciate.

auspicious3000 commented 4 years ago

For different dataset you need to retune the bottleneck. Also, feel free to try different encoder and decoder architectures. The paper proposed a framework instead of specific architectures. Voxceleb2 is not very clean, for example, if the channel effects and background noises are different, you need to disentangle them by conditioning on these information. Otherwise, it will not achieve disentangled representations for conversion. I suggest you start with a clean dataset such as vctk.

light1726 commented 4 years ago

Thanks for sharing. From my experience, the temporal resolution of the bottleneck feature (related to mel-spectrogram extraction hop-length and the downsampling frequency) seems to be important for the encoder to disentangle. When I extracted mel-spectrogram with hop-length of 250, the down-sampling frequency 32 shows better performance in conversion than the down-sampling frequency of 16. Currently, I extract mel-spectrogram with hop-length of 200 and increase down-sampling frequency to 40, the conversion performance is still worse than 250 hop and 32 freq.