Open PiotrDabkowski opened 2 years ago
Thanks for sharing your experience!
I'm also suffering from poor voice conversion result and trying to figure it out. The results sound quite reasonable, but the quality was not as the authors'. I'll share here some samples that I regarded as bad.
For me the results in the paper are a bit weird, tbh, I was able to get a high quality restoration just based on the perturbed layer 12, this means that there is still a significant identity leakage through the "Linguistic" layer.
One similar but different issue I felt was speaker identity leakage from the pitch feature. I think might be from insufficient perturbation, but not sure since my implementation is slightly different from the paper.
For Yingram I also used the fft approach but it would be better to write a custom kernel instead, otherwise the windowing is a bit broken...
Thanks for sharing your experience!
I'm also suffering from poor voice conversion result and trying to figure it out. The results sound quite reasonable, but the quality was not as the authors'. I'll share here some samples that I regarded as bad.
Hi, first thanks for the great implementation!
Compared to the results you shared here in sample.zip, have you been able to improve the synthesis quality after fixing the issue here (https://github.com/dhchoi99/NANSY/issues/3)?
@JeromeNi Fixing that issue didn't give much improvement to voice conversion quality. Training longer than the paper helped improving quality, but still having lower quality than the paper.
Hey, really nice work!
I also have my own private NANSY implementation - it seems to work, at least the reconstruction is solid, but the voice conversion results were pretty poor, worse than the ones in the original paper samples (not sure whether they were cherrypicked). The speaker similarity was not that good, and I achieved better results using a different method.
Do you have some Voice Conversion samples?