Reduce latency - Githubissues

danx0r commented 5 months ago

Thought it made sense to collect thoughts about latency.

ccappetta did some profiling: Spacebar press User audio file is generated and faster-whisper transcribes it to text (~4.5 seconds) Transcribed user text is sent to anthropic and the first LLm response token is streamed back (~3.5 seconds) Enough LLM tokens are streamed back to generate a ~150 character chunk and send that text off to Elevenlabs (~1.25 seconds) Elevenlabs generates and returns the audio file (~1.25 seconds) Pygame playback begins (0.1 second)

IIUC presently the whisper model doesn't start transcribing until user input is finished (user hits spacebar). Transcription is atomic, all of user input is transcribed and then returned as a text string.

I also did some profiling, my results seem consistent with CC's. For transcription of input, there seems to be a fixed overhead of ~4 second even if user input is very short (2 words). A 60-word input took ~7 seconds. We can model this as t = 4 + w * 20 where t is seconds taken and w is # of words.

I'm seeing 1.5-2.0 sec response time from Anthropic

Pipeline STT: The right approach seems to be to start transcribing user input speech asynchronously so it runs in parallel with recording. CC mentioned this project: https://github.com/KoljaB/RealtimeSTT Another candidate (this one uses whisper): https://github.com/ufal/whisper_streaming

I also see considerable variability with elevenlabs response time, from about 1 second up to 3-4 seconds. It doesn't seem to correlate with the length of the response. I think this occasional delay was apparent in the last youtube chat.

Personally, I'm not stuck on elevenlabs -- TTS has been around a long time, I'm sure there are mature, optimized, open-source alternatives. I would prefer less "character" in my voices, it's OK for it to be a tad robotic. The emotion Elevenlabs voices express are algorithmically derived by their proprietary models. They are probably great for casual users to have fun with but I would be fine with a very "neutral voice" like HAL9000. Just my 2c

danx0r commented 5 months ago

Did some testing of TTS alternatives (Text-to-speech, ie what 11labs does). Thought I would leave my impressions here for posterity.

First : Strangely, I found it difficult to find robust, open-source, offline (ie not an API to some sever) speech generation tools using state-of-the-art current algorithms. There were older programs like festival and eSpeak, in case you want your voice to sound like the love child of T-Pain and Stephen Hawking.

I did find 3 candidates that I think deserve consideration, should we need an alternative to ElevenLabs. Unfortunately, none has realtime streaming built in (except WhisperSpeech which has other problems). However most of them chunk large inputs by sentence, and it shouldn't be too difficult to work asynchronously (ie pipeline) on a sentence granularity.

coqui-ai This seems to be the most popular, looks like they forked Mozilla TTS (I couldn't get the original to build, last commit was 3 years ago; coqui has recent commits). The source is open but the models (at least some) have non-commercial licensing. Unclear how far they are with streaming, it's mentioned in their issues as a desired feature.

OpenVoice -- seems legit Open Source for code, and can create a model on as little as 20 seconds of audio (this avoids the licensing issues with downloaded models). The code is simple and the documentation looks good. This would probably be my first choice as a starting point.

WhisperSpeech -- they say they are "reversing whisper" whatever that means. Code is recent; vocal quality good. However needs GPU to run and still slow (looks to be about 1X realtime) also has the feel of a research project. I include it primarily because it has streaming capability, so maybe there are some insights in their code.

danx0r commented 5 months ago

(STT) Looking at whisper_streaming. Seems to work well with GPU, too slow without. Interesting as my machine has no GPU but seems to transcribe the input audio quite quickly (around 20 words per second when it gets going).

I suspect it is a matter of setting the correct parameters for whisper.

Here's example output (running on gpu in cloud with simulated realtime):

3312.5257 920 2860  Testing 1, 2, 3.
6395.0167 3620 4240  This is
7391.6059 4240 5860  whisper streaming.
11935.2713 6720 8220  We're going to see if it can
12940.6056 8220 11240  break up the sentences and chunk up the input.
15221.4479 12240 12900  Testing.
18936.8753 14080 16740  So now we have to put a little more audio in
25236.3958 16740 23460  here. And a little more. And then just a little more. And then let's have a really long sentence that kind of
25786.7033 23460 24220  runs on for a
26810.9887 24220 25360  while and never seems to
27810.7953 25380 26580  get to its point.
29938.0305 27360 28500  And then we'll have a short
31027.3826 28500 28980  sentence.
32028.8050 29380 30560  And then we'll have some silence.

danx0r commented 5 months ago

OK changed to "small" model and transcription is about 3X realtime on my non-gpu machine. The buffering logic seems to work; for a 30 second test file, latency was 3-5 seconds. Meaning, when user hits space bar transcription should only have at most 5 seconds of transcription left to do (should take < 2 seconds at 3X), and the 4-second upfront cost is already paid.

danx0r commented 5 months ago

whisper_streaming: Tested async capture & transcribe with microphone input; it worked pretty well. The example code is implemented as an HTTP service, which might be convenient in the case where you want to run the server in the cloud (GPU support for example).

https://github.com/ufal/whisper_streaming?tab=readme-ov-file#server----real-time-from-mic

ccappetta commented 5 months ago

this is great stuff. RVC is another open local one on my radar to check out - https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/en/README.en.md

it's occurring to me that we may want to have the option to keep a bunch of different variations available, and I'm thinking different branches might be the way to do that.

e.g. I have a colleague that wanted to run it on a pretty lightweight windows laptop so we whipped up a copy that uses the assembly ai api for the transcription instead of relying on his local GPU. Maybe the main branch is where I put whatever windows version I'm most recently running on the youtube conversations and then we have separate branches for e.g. linux-friendly, mac friendly, assembly ai transcription version, rvc, whisper_streaming, etc. Thoughts?

danx0r commented 5 months ago

Funny you should mention RVC - I use it to do music! Here's something I'm working on (warning: partially ironic lounge music) https://www.youtube.com/watch?v=hqiqIcmcZ1M

TBH I've only used it to clone singing parts, I have no idea how to get it to generate speech from text. I'll look into it.

Hmm, I don't see a TTS option - it seems entirely devoted to cloning

ccappetta commented 5 months ago

big round of closing comments for my own organizational sanity, please shout to continue this convo!

ccappetta / bidirectional_streaming_ai_voice

Reduce latency #10