Closed Coice closed 2 years ago
@coice
Which vocoder do you feel has the best overall quality (ignoring inference speed)
If you do not care about inference speed then WaveGrad.
Does adding a speaker embedding improve overall synthesis quality when using a multi-speaker model?
I didn't find any clear improvements from adding a speaker embedding to the vocoder, but many of my speakers have very little data so you may find different results. (and the UMAP projection showed that speakers were clustered around the microphone/recording environment used, so the embedding is definitely used by the vocoder to at least some extent)
@CookiePPP
Thanks for responding!
I have personally been using melgan/hifigan in most of my experiments, but quality is still much lower than desired (evaluated using teacher-forced mels). I will try WaveGrad and compare.
Have you tried Fre-GAN? They report near ground truth quality. My fregan results with fine tuned model mels sounded metallic, on real mels the quality was great, better I would say than hifigan. I might revisit that as well and double check my params.
Have you tried Fre-GAN?
No.
better I would say than hifigan
My best HiFi-GAN was almost indistinguishable from ground truth so I haven't spent a lot of GPU time looking into alternatives.
@CookiePPP
Do you happen to have any audio samples you can share of your highest quality synthesis from your TTS engine?
Also do you know of any groups, discord, etc, for discussing this subject?
Again, thanks for your time!
Do you happen to have any audio samples you can share of your highest quality synthesis from your TTS engine?
Sorry, no. After 14 months I don't remember the exact location of them the audio samples I referenced.
Also do you know of any groups, discord, etc, for discussing this subject?
I know plenty of discords, but none where developers/people-that-can-write-code make up the majority of people. You can probably find better discords using Google haha. I don't really go looking for discords unless they have interesting people running them.
Fair enough, thank you for your time!
Hello!
You seem to have done quite a bit of vocoder comparisons. I have two questions based on your own personal experience.
Which vocoder do you feel has the best overall quality (ignoring inference speed) when fine-tuned from mel's (such as from tacotron2)?
Does adding a speaker embedding improve overall synthesis quality when using a multi-speaker model?
Thank you for your time!