What is the performance of WhisperSpeech?

kurianbenoy-sentient commented 9 months ago

I was trying to understand the performance of WhisperSpeech in TTS and Voice cloning.

Is there any results available as benchmarks or paper to compare the performance of WhisperSpeech project with respect to other project like OpenVoice and Spear-TTS.

Thanks for creating this awesome library. I really liked this project compared to OpenVoice in my initial analysis :)

zoq commented 9 months ago

Benchmarking different TTS models is challenging, since we don't really have a metric to measure the audio quality. What we can do is provide samples, to compare the different models. But not sure this is what you are looking for?

BBC-Esq commented 9 months ago

On my RTX 4090 I did some basic tests in terms of memory usage, and the quality was about the same as Bark, so maybe that'll help a little.

https://github.com/collabora/WhisperSpeech/issues/68#issuecomment-1917828974

If memory serves, the "tiny" WhisperSpeech model was a little faster than even the smallest Bark model, but overall they were very comparable in terms of quality of recognizing words to speak and the voices themselves are so close to Bark it's hard to distinguish. So if WhisperSpeech continues to progress like I think it will, I see it surpassing Bark and the other options out there.

Also, I tested Coqui and a few others over the weekend and all of their voices are inferior so...I see the best open source ones being Bark and WhispersSpeech. Not referring to proprietary ones of course (Hey Siri!).

I tried multiple models and voices with this and none produced as high quality as Bark or WhisperSpeech, but many were much, much, much faster...but again, you'll get an electronic-sounding, computer-sounding...etc. voice.

https://github.com/coqui-ai/TTS

kurianbenoy-sentient commented 9 months ago

I was looking on two aspects mainly:

Overall TTS quality(compared to Coqui-ai TTS, Bark AI, Openvoice etc.)
Voice cloning quality (compared to Openvoice and this project)

It looks the only way is comparing with voice sample and then identifying this is better. Actually metric for TTS which MOS is also the score humans assign based on audio quality.

jpc commented 9 months ago

Manual listening tests with MOS seems to be the only reliable metric right now. Could be an interesting community project to make a leaderboard for TTS models with crowdsourced scoring.

BBC-Esq commented 9 months ago

Yeah, and it'd be hard though because audio is much more subjective...

The voice cloning seems subjective to a certain extent, but I suppose you could try to prove it by examining the spectrograms or wavegrams of the audio to see if they have similar structures...But still I think it's partially subjective.

Best approach IMHO, have a simple survey of people regarding the voice cloning aspect and the quality of non-cloned voices. Speed should be measurable as long as apples to apples comparisons are done (e.g. using the same beam size/quantization level, etc.)

BBC-Esq commented 6 months ago

Congratulations on the new fast-small t2s model. Here's the updated benchmarks!

collabora / WhisperSpeech

What is the performance of WhisperSpeech? #81