[Feature request] Using local whisper transcription with word time stamps to remove tts hallucinations

DrewThomasson commented 11 months ago

🚀 Feature Description

Currently all the transformer based tts models I've run into deal with issues of hallucinations, especially at the end for instance even with XTTS V2, I was wondering if there was any planned way to remove at least the hallucinations that appear at the end of the generated audio, for instance the text could be "hey, bob" and the output audio will be "hey, bob. Other!"

Solution

You could use a solution like this where you have a cleanup method using this whisper repo: https://github.com/linto-ai/whisper-timestamped: to generate a transcription with the time stamps for each word in the generated output audio, to then compare that transcription-with-time-stamps to the words used in the input for the tts models

Additional context

I'm planning on trying to make and use a method like that for my own project, and was just wondering if anything like that was in the works. Thanks!

cmp-nct commented 11 months ago

It definitely needs a feedback loop, tested xtts-v2 and it just speaks and speaks gibberish after the sentence is done. But you don't need timestamps you need a realtime feedback.

xttsv2 has two major issues with that: 1) hallucination, often at the end 2) cutting short of the last word - that is basically the opposite of the hallucinations

It needs some alignment between output tokens and intput tokens. Maybe it can also be fixed with fine tuning

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

coqui-ai / TTS

[Feature request] Using local whisper transcription with word time stamps to remove tts hallucinations #3315