auspicious3000 / SpeechSplit

Unsupervised Speech Decomposition Via Triple Information Bottleneck
http://arxiv.org/abs/2004.11284
MIT License
636 stars 92 forks source link

How to fix the vibrato result? #46

Closed CYT823 closed 3 years ago

CYT823 commented 3 years ago

Hi everyone,

I was trying to make my own Generator model; however, I found the result always carries Vibrato.

datasets: VCTK + LibriSpeech clean-100 + LibriSpeech clean-360 (with no data augmentation) Instead of using one-hot speaker id, I was using speaker embedding. The validation loss is 47.18.

Here is my result. The intonation and naturalness sound okay, but the voice sounds like a man/woman speaking in front of a fan, and the microphone is three steps away from the speaker.

Could anyone give me some advice or suggestion that may fix this kind of issue? Should I change the datasets or maybe all I need is data augmentation? Thanks in advance.

yenebeb commented 3 years ago

Heya,

Looks like a problem with the dimension neck of the encoders. The model is quite sensitive to the dimension necks. From the paper: "Conversely, if the converted speech is of very poor quality, it implies that both the rhythm code and the content code are too narrow. Try increasing them simultaneously."

You should look up the current dimension necks for the encoders in hparams. If i'm not wrong they should be: dim_neck for the Content encoding dim_neck_2 for the Rhytm encoding dim_neck_3 for the Pitch encoding

During training you should be able to find outputs of spectograms in your run/samples map. It's pretty important to check these out and compare them with figure 8 of the paper. There's also a section in the paper on how to tune the dimension neck (B.4) i'd highly recommend you to check that out as well :)!

CYT823 commented 3 years ago

Thanks for your recommendation @yenebeb :)