aviks / Whisper.jl

Implementation of OpenAI Whisper model based on whisper.cpp
MIT License
47 stars 4 forks source link

Streaming discussion #7

Open Moelf opened 1 year ago

Moelf commented 1 year ago

So there's not really an streaming API, more like a POC from https://github.com/ggerganov/whisper.cpp/blob/master/examples/stream/stream.cpp

the main idea is this:

  1. you start with some buffer (the audio_async is a thin wrapper around a circular buffer) https://github.com/ggerganov/whisper.cpp/blob/70567eff232773d6786c91585d040f53c36b87a4/examples/common-sdl.h#L15

  2. in the !use_vad case, you simply wait until enough audio are available, and audio.get(params.length_ms, pcmf32) dumps into the float32 vector pcmf32

  3. run whisper_full(ctx, wparams, pcmf32.data(), pcmf32.size()) normally

  4. use whisper_full_n_segments(ctx) and whisper_full_get_segment_text(ctx, i) normally

  5. the only different thing is afterwards you want to add token from last full segment into wparams.prompt_tokens for next segment

the general idea of audio buffer is to pad n seconds, n < 30 into 30s, so as you speak, you're inference 1s + 29s silence, then 2s + 28s silence etc. depending on how large step_ms is.


In the use_vad case, we have more pcmf32 related vectors to swap audio data around (~slide window) https://github.com/ggerganov/whisper.cpp/blob/70567eff232773d6786c91585d040f53c36b87a4/examples/stream/stream.cpp#L162-L164

the pcmf32 and friends are the actual sample you copy to and from for direct usage

aviks commented 1 year ago

I know @jpsamaroo was experimenting along these lines.

jpsamaroo commented 1 year ago

As a quick overview of what I implemented:

I use PortAudio.jl to provide the input stream in 4-second increments, and write it into a rotating buffer of 5 seconds of total length (although these periods are configurable; they just seem to work for me). I convert all audio into 16K with this code from the README:

# Whisper expects 16kHz sample rate and Float32 data
sout = SampleBuf(Float32, 16000, round(Int, length(s)*(16000/samplerate(s))), nchannels(s))  
write(SampleBufSink(sout), SampleBufSource(s))  # Resample

All this happens continuously in one task, and a copy of the 5-second buffer is copied into a Channel everytime we sample 4-seconds from the PortAudio stream. In another task, I then transcribe with Whisper using max-threads (will post a PR for wparam configuration momentarily), append it to an ever-growing string (which is the full transcription from start to end), and print the latest result.

Gist here: https://gist.github.com/jpsamaroo/aff348ae04f392f1e8683b59cbe6bda7

One thing you'll notice is the BAD THINGS HAPPENED, which is detecting my observation that sometimes Whisper gets "stuck" and just repeats the same transcription (usually after about 60 seconds of this "streaming" transcription). It's probably just something I'm doing wrong, but if anyone has any ideas on why it happens, I'd love to know!