auspicious3000 / SpeechSplit

Unsupervised Speech Decomposition Via Triple Information Bottleneck
http://arxiv.org/abs/2004.11284
MIT License
636 stars 92 forks source link

Tuning bottlenecks according to Appendix B.4 #40

Closed vishal16babu closed 3 years ago

vishal16babu commented 3 years ago

Although the tuning process mentioned is very intuitive, it seems like there's no theoretical guarantee that the same bottleneck sizes will work for all speakers. I think it's a research problem in itself to be able to decide the bottlenecks directly from the speech(without going through the manual tuning process).

But practically speaking, it might be possible that a set of bottleneck sizes might work well in general for most of the cases. Is that the case with the sizes used in the repo? Did anyone try using the same sizes on a different dataset? Since training takes a long time, for each iteration of the tuning process for every new speaker or dataset, I'm afraid the approach might become very impractical to use.

@auspicious3000 any insights or help is very much appreciated

auspicious3000 commented 3 years ago

The bottleneck sizes provided in the paper is a good start. Training usually takes less than 24 hours. As a research project, our main purpose is to make sizable progress towards and provide insights for unsupervised speaking style transfer, and hopefully, inspire other researchers in this area.

vishal16babu commented 3 years ago

Thanks @auspicious3000 , I will give it a try.