Open kurianbenoy-sentient opened 9 months ago
Benchmarking different TTS models is challenging, since we don't really have a metric to measure the audio quality. What we can do is provide samples, to compare the different models. But not sure this is what you are looking for?
On my RTX 4090 I did some basic tests in terms of memory usage, and the quality was about the same as Bark, so maybe that'll help a little.
https://github.com/collabora/WhisperSpeech/issues/68#issuecomment-1917828974
If memory serves, the "tiny" WhisperSpeech model was a little faster than even the smallest Bark model, but overall they were very comparable in terms of quality of recognizing words to speak and the voices themselves are so close to Bark it's hard to distinguish. So if WhisperSpeech continues to progress like I think it will, I see it surpassing Bark and the other options out there.
Also, I tested Coqui and a few others over the weekend and all of their voices are inferior so...I see the best open source ones being Bark and WhispersSpeech. Not referring to proprietary ones of course (Hey Siri!).
I tried multiple models and voices with this and none produced as high quality as Bark or WhisperSpeech, but many were much, much, much faster...but again, you'll get an electronic-sounding, computer-sounding...etc. voice.
I was looking on two aspects mainly:
It looks the only way is comparing with voice sample and then identifying this is better. Actually metric for TTS which MOS is also the score humans assign based on audio quality.
Manual listening tests with MOS seems to be the only reliable metric right now. Could be an interesting community project to make a leaderboard for TTS models with crowdsourced scoring.
Yeah, and it'd be hard though because audio is much more subjective...
The voice cloning seems subjective to a certain extent, but I suppose you could try to prove it by examining the spectrograms or wavegrams of the audio to see if they have similar structures...But still I think it's partially subjective.
Best approach IMHO, have a simple survey of people regarding the voice cloning aspect and the quality of non-cloned voices. Speed should be measurable as long as apples to apples comparisons are done (e.g. using the same beam size/quantization level, etc.)
Congratulations on the new fast-small t2s model. Here's the updated benchmarks!
I was trying to understand the performance of WhisperSpeech in TTS and Voice cloning.
Is there any results available as benchmarks or paper to compare the performance of WhisperSpeech project with respect to other project like OpenVoice and Spear-TTS.
Thanks for creating this awesome library. I really liked this project compared to OpenVoice in my initial analysis :)