ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.44k stars 3.61k forks source link

Continuous Recognition Possible? #304

Open KTRosenberg opened 1 year ago

KTRosenberg commented 1 year ago

Has anyone tried extending this with the ability to go indefinitely beyond 30 seconds, continuously, in real-time? Also, it'd be useful to implement "hypothesis" in the sense that something like Apple's SFSpeechRecognizer does: i.e. a callback fires every time a new word is recognized. So I think this could be implemented by using some buffering and segmenting on silence to avoid splitting words, and maybe some temporary buffer to get the hypothesis, but otherwise, I am not sure. The end-goal isn't to get a final transcription, but rather have a sliding window. This would be super helpful.

JohnnyOpcode commented 1 year ago

I think this sort of goes hand-in-hand with VAD (voice activity detection) as well. Make it loop and toss events when silence is detected. Nonetheless this is one of the coolest repos I am following atm and my kudos go to the brilliant craftsman who forged whisper.cpp. It opens many possibilities..

RndyP commented 1 year ago

Yes. See Discussion #211

KTRosenberg commented 1 year ago

Yes. See Discussion #211

I see the discussion, but not a full concrete example. Is there something more complete that has been pieced together?

ggerganov commented 1 year ago

@KTRosenberg It depends how complete you want it to be.

The stream example is a good starting point and a prototype for real-time ASR. It also supports basic VAD as shown in #211. It's not a production-ready application, but is simple enough to be modified and extended in any way you like.

Today I also saw that someone made a basic electron app for real-time transcription that looks interesting: https://github.com/ggerganov/whisper.cpp/issues/137#issuecomment-1363301976

There is also a prototype for short voice command detection in the command example.

A lot of different bits and pieces are already available.

KTRosenberg commented 1 year ago

I see. The bit I’m a little unclear on is getting around the 30 second limit, but I guess those examples have to achieve that. What I need is a way to buffer all the words recognized so far and not re-recognize the entire audio from the beginning outside it window, but again, it sounds like the examples linked do somethint like that.

I remember, however, that there was some bug with the default VAD that would cause an infinite pause.

How would you recommend that I combine these pieces?

KTRosenberg commented 1 year ago

Hello again. I'm still a little stuck on exactly what I need to change to get things working. To recap, I'm looking for something like Apple's SFSpeechRecognizer, which outputs a new result every time a new word is detected, but has a 60 second limitation. I'm trying to do the same thing with Whisper in real-time, but use some sort of sliding window to get around the time limitation. I also hope this could work in-real-time with acceptable performance on an iOS device.

On that last note, the stream example uses a lot of SDL, which is a dependency I was hoping to avoid since I already use native iOS and macOS apis (e.g. AVAudio). I wonder if there's a more native equivalent of this example that would be easier to try.

RndyP commented 1 year ago

Over in discussion #211 I described how I am doing realtime transcription. It works well. I should add that if I encounter more than 10 seconds without a gap, I go ahead and process the entire chunk, and leave about 300 mS of overlap for the next chunk. This way, it will never get close to 60 seconds before it processes.

What is your platform? Are you having trouble with the actual audio acquisition? I'm in Windows and use the wave API, and essentially double buffer by allocating 20 1 second buffers for Windows to use, and also have my own ring buffer. In this manner I get an event fired by windows every second, I grab the data and stuff the ring buffer, process by creating an envelope (or you can use a VAD) and then look for a gap starting at the end of the buffer.

I process the data in a worker thread while the acquisition continues.

The end result is single words are processed with a delay of about 2 seconds (the events are every second), and Whisper takes about 0.8 seconds to process the shortest chunk. Longer sentences will be forced to process every 10 seconds, so the transcription does not get too far behind.

KTRosenberg commented 1 year ago

Hi. So lower-level audio like this is definitely out of my area, so the issue for me is just “how to do this?” I’m focused on the interaction techniques element ot my project. Typically, I would just use the libraries I have. My platform is iPadOS / macOS, which use AVAudio and CoreAudio. The built-in system had a callback containing a float buffer, I think, but I typically just let the library do its thing. SDL probably uses this callback internally on macOS, iPadOS, and iOS.

The speech recognizer essentially operates at close to real time. <0.5 seconds after each word is spoken, it spits out an updated transcription. I would basically like to recreate this behavior using Whisper, if it’s capable of operating at that speed. It’s because Whisper is better at recognizing domain-specific words, from what I can tell. However, it it takes several seconds per word for whisper to give user feedback, that won’t quite work.

But if it does work, that would be great.

Another behavior I’d like to introduce is that the user should be able to shut-off and turn-on recognition at any time.

In any case, I’m currently operating under some too-close-for-comfort deadlines and really doubt I could recreate something similar to what you’re describing quickly enough. This might be a bit of an ask, but is there any way to get a running demo of what I’m describing for the Apple platforms? Again, I’m not sure it it can operate as fast as I’m describing since I think I need < 1 second to make it feel good, but I think < 2 seconds might still be good. This would be a major help.

For what it’s worth, this is the official example https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio?language=objc

However, this does not show the callback system, but it’s basically the same. The only difference is that this one times out automatically, which is what I want to avoid. I want to turn on/off at will.

RndyP commented 1 year ago

Maybe @ggerganov can chime in here, but it seems to me that what you are asking may be difficult. Whisper seems to have a fixed processing offset, by that I mean that the smallest possible chunk will take about, let's say, 0.8 seconds to process. This means that with many small 1 second chunks, the CPU loading is 80%. Longer chunks do not take proportionally more time, there is an advantage to processing in 10 second or longer chunks. I am seeing about 4.5 seconds for 10 second chunks. Also, if you are trying to get real-time transcription at word granularity, you are going to have to overlap carefully.

KTRosenberg commented 1 year ago

Maybe @ggerganov can chime in here, but it seems to me that what you are asking may be difficult. Whisper seems to have a fixed processing offset, by that I mean that the smallest possible chunk will take about, let's say, 0.8 seconds to process. This means that with many small 1 second chunks, the CPU loading is 80%. Longer chunks do not take proportionally more time, there is an advantage to processing in 10 second or longer chunks. I am seeing about 4.5 seconds for 10 second chunks. Also, if you are trying to get real-time transcription at word granularity, you are going to have to overlap carefully.

I might be describing things poorly. Maybe I should record a video of what I mean?

KTRosenberg commented 1 year ago

Here's a speed test of my existing speech recognition (using Apple's proprietary recognizer): Google Drive Link It has some errors during the quicker hypothesis case, but that's expected before it magically finalizes and correct words.

Currently since it's closed software, it's limited to 60 seconds and there's not much I can do about it. I have to do some weird buffering of the text. I don't currently have a way to bridge the gap in audio, so that can create awkward results.

RndyP commented 1 year ago

I viewed your video, and see what you mean. It responds very quickly. Again, I'm not sure whisper is going to deliver that smoothly because of what I'm calling the "offset" problem. @ggerganov 's video over at #211 works pretty well, better than mine, perhaps he can shed some light on this.

KTRosenberg commented 1 year ago

I viewed your video, and see what you mean. It responds very quickly. Again, I'm not sure whisper is going to deliver that smoothly because of what I'm calling the "offset" problem. @ggerganov 's video over at #211 works pretty well, better than mine, perhaps he can shed some light on this.

You're right -- that video shows something that seems to run faster.

This might be crazy, but what if there were two recognizers: 1> runs continuously, waits for the user to hit a button to split it (still needs to overcome the 30 second limit) 2> another recognizer operates at the word level to create the appearance of continuous recognition. This could even be the Apple recognizer because this is more for the presentation.

I think an all-in one solution using Whisper would make more sense than trying to do something complex like that though.

Yes I would be interested in hearing from ggerganov.

On another note, even though this is about whisper, I wonder if in the meantime I can do something with SFSpeechRecognizer to make it behave better.

ggerganov commented 1 year ago

Whisper does not have an option to process word-by-word. It always takes a 30s audio buffer as input. If your hardware allows fast processing of the model (i.e. < 0.5s) per run, then you might be able to achieve what you are looking for.

For example, here is running the stream example with your audio file, using tiny.en at --step 500 on MBP M1 Pro gives the following result:

https://user-images.githubusercontent.com/1991296/216752726-b7441c0c-1cad-41e4-afdc-79eb2b698699.mp4

For a similar example for mobile, you can check the whisper.objc example. I am using the base model there, but you can simply switch it to tiny and get faster response time.

Another strategy you might try is to use the tiny model for real-time transcription, and then from time-to-time to run a bigger model (base or small) on past audio for better refinement. Again, this is prototyped in the whisper.objc example.

But we don't have anything ready out-of-the-box. You'll have to do some tinkering, adjustment of the buffer sizes, splitting the audio properly, etc. It's difficult because the main application of Whisper is for long, non-real-time audio processing.

david-rokeby commented 1 year ago

I have heavily customized stream.cpp for continuous real-time. It is designed specifically to emit both completed phrase streams and in progress phrase streams separately. (The application is a bit unusual…) I dump audio and emit complete phrases on pauses but also count how many times in a row a segment’s text is the same. On the third identical repetition, I emit and dump audio up to that segment’s end time stamp. There are a few other tricks and subtleties to get it clean. I.e. when I detect third repetition, I walk forward from segment[0], looking for a confident period or question mark ending a qualified segment and then emit all the consecutive segments that qualify (3 repeats) to form a complete phrase. I do not emit and dump segments that are not clearly leading up to a sentence / phrase end because that messes up accuracy. Occasionally I fill the 30 second buffer and so do a forced emit/dump in those situations. i also check for matches between the last words of the previous segment and the first words of the next especially after dumping audio and try to prune these. It is very comfortably real time on my M1Max even if I restrict to 4 threads.

KTRosenberg commented 1 year ago

I have heavily customized stream.cpp for continuous real-time. It is designed specifically to emit both completed phrase streams and in progress phrase streams separately. (The application is a bit unusual…) I dump audio and emit complete phrases on pauses but also count how many times in a row a segment’s text is the same. On the third identical repetition, I emit and dump audio up to that segment’s end time stamp. There are a few other tricks and subtleties to get it clean. I.e. when I detect third repetition, I walk forward from segment[0], looking for a confident period or question mark ending a qualified segment and then emit all the consecutive segments that qualify (3 repeats) to form a complete phrase. I do not emit and dump segments that are not clearly leading up to a sentence / phrase end because that messes up accuracy. Occasionally I fill the 30 second buffer and so do a forced emit/dump in those situations. i also check for matches between the last words of the previous segment and the first words of the next especially after dumping audio and try to prune these. It is very comfortably real time on my M1Max even if I restrict to 4 threads.

That sounds great! Is it something you'd be comfortable sharing? My computer happens tohavee an M1 Max. I also have an iPad Pro M2.

david-rokeby commented 1 year ago

I can share… it is a bit organic and tailored to a specific application, classed up in c++. I end up wrapping it in a node to include in a node based set of tools for interaction with machine-learning models.

Sent from my iPhone

On Feb 10, 2023, at 11:56 PM, Karl Toby Rosenberg @.***> wrote:

 I have heavily customized stream.cpp for continuous real-time. It is designed specifically to emit both completed phrase streams and in progress phrase streams separately. (The application is a bit unusual…) I dump audio and emit complete phrases on pauses but also count how many times in a row a segment’s text is the same. On the third identical repetition, I emit and dump audio up to that segment’s end time stamp. There are a few other tricks and subtleties to get it clean. I.e. when I detect third repetition, I walk forward from segment[0], looking for a confident period or question mark ending a qualified segment and then emit all the consecutive segments that qualify (3 repeats) to form a complete phrase. I do not emit and dump segments that are not clearly leading up to a sentence / phrase end because that messes up accuracy. Occasionally I fill the 30 second buffer and so do a forced emit/dump in those situations. i also check for matches between the last words of the previous segment and the first words of the next especially after dumping audio and try to prune these. It is very comfortably real time on my M1Max even if I restrict to 4 threads.

That sounds great! Is it something you'd be comfortable sharing? My computer happens to be an M1 Max. I also have an iPad Pro M2.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

cerupcat commented 1 year ago

@david-rokeby I'd be interested in your implementation too.

I'm also looking to do real-time(ish) detection for livestream captions. So far, I haven't found a good solution for processing smaller chunks at an efficient/effective rate. @KTRosenberg did you ever get anywhere with this?

mrmachine commented 1 year ago

I can share… it is a bit organic and tailored to a specific application, classed up in c++. I end up wrapping it in a node to include in a node based set of tools for interaction with machine-learning models.

Please do share 😁

Himnish commented 1 year ago

I can share… it is a bit organic and tailored to a specific application, classed up in c++. I end up wrapping it in a node to include in a node based set of tools for interaction with machine-learning models.

I'd be interested in learning more about this implementation too, if you can share!

nikinov commented 1 year ago

Same here