Shaunwei / RealChar

🎙️🤖Create, Customize and Talk to your AI Character/Companion in Realtime (All in One Codebase!). Have a natural seamless conversation with AI everywhere (mobile, web and terminal) using LLM OpenAI GPT3.5/4, Anthropic Claude2, Chroma Vector DB, Whisper Speech2Text, ElevenLabs Text2Speech🎙️🤖
https://RealChar.ai/
MIT License
5.96k stars 729 forks source link

Reduce VAD latency. #430

Closed hksfang closed 1 year ago

hksfang commented 1 year ago

Utilize whisper transcription for speech interim chunks to achieve transcribing while speaking. Test video: https://drive.google.com/file/d/19I73GiIcz3Rkj6zb2KuTD93GLXFFd3y8/view?usp=sharing

Shaunwei commented 1 year ago

Thanks for the PR. @hksfang with and w/o the change, qq, how much latency have it reduced?

hksfang commented 1 year ago

Thanks for the PR. @hksfang with and w/o the change, qq, how much latency have it reduced?

The result varies on the speaker's speaking pattern, for longer speech with more 'gaps' in between, this PR reduces considerable latency, but for shorter speech or speech with little to no gaps, this PR doesn't do much.

I found it difficult to benchmark this change, but I can provide an example. Using this audio file as example, before this PR, transcription after the end of speech took 5.5s, after this PR, only 2s, that's around 54% reduction in transcription latency.

Shaunwei commented 1 year ago

Thanks for the PR. @hksfang with and w/o the change, qq, how much latency have it reduced?

The result varies on the speaker's speaking pattern, for longer speech with more 'gaps' in between, this PR reduces considerable latency, but for shorter speech or speech with little to no gaps, this PR doesn't do much.

I found it difficult to benchmark this change, but I can provide an example. Using this audio file as example, before this PR, transcription after the end of speech took 5.5s, after this PR, only 2s, that's around 54% reduction in transcription latency.

Wow. this is super awesome. Thanks for making the change. LGTMed