Suggestions for improving real-time performance?

xoxfaby commented 1 year ago

My particular use case requires me to have access to the transcription of a fully uttered sentence as quickly as possible, as it is being uttered.

Are there any optimizations I can make to how I use faster-whisper that will benefit this?

Currently I'm detecting the speech and once it is finished, I give the whole thing to faster_wisper to transcribe. This is problematic because if it's a long sentence, I have to wait until the end to begin processing the beginning. Is there some reasonable way I can feed the beginning into faster_whisper before the rest of the sentence is done, without sacrificing accuracy?

Since my understanding of the model is limited, I'm unsure if I'm missing something more obvious to improve this, like I don't know if the model even technically needs the whole thing to begin processing or if it could do the work as it comes in, in a way that just isn't done normally since it's not a concern if you're loading data at the rate it can be read, rather than the rate it is spoken.

Would love any input on this.

phineas-pta commented 1 year ago

internally audio is cut into 30s chunks before feed to model

so for 1st try u can send data every 30s

xoxfaby commented 1 year ago

I already send audio significantly more frequently than that, the clips are usually 1-5 seconds long.

On Tue, Jul 11, 2023, 15:54 Phan Tuấn Anh @.***> wrote:

internally audio is cut into 30s chunks before feed to model

so for 1st try u can send data every 30s

— Reply to this email directly, view it on GitHub https://github.com/guillaumekln/faster-whisper/issues/348#issuecomment-1630879778, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHTWHLNG2FDZKRP6YMSR2TXPVLKLANCNFSM6AAAAAA2FHSV2Q . You are receiving this because you authored the thread.Message ID: @.***>

phineas-pta commented 1 year ago

u can inspire from various repo shared in openai/whisper#2

jnhck commented 1 year ago

You do not necessarily need to wait for a sentence to finish to start transcribing. You could also start with a shorter sequence and then when more is available send more afterwards. Obviously the accuracy will get higher as time goes on and the sentence becomes more complete, but even before you can get good results (at least while using the largest model).

SYSTRAN / faster-whisper

Suggestions for improving real-time performance? #348