elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.33k stars 96 forks source link

Stream audio chunk by chunk to Whisper #261

Closed mat-hek closed 6 months ago

mat-hek commented 12 months ago

Hey, it's already possible to make the serving provide data to the model in chunks, however, it seems that the whole audio still has to be available at once, which is impossible for live streaming. Would it be possible to support streaming the audio to the serving chunk by chunk?

josevalim commented 12 months ago

We do support streaming in latest Bumblebee but we only support streaming files. We would need to improve the API so you can pass your own stream to Whisper and then we transform it. :) So it is not possible yet but we may be 90% there.

mat-hek commented 12 months ago

Hmm, correct me if I'm wrong, but it seems you load the entire file to the memory upfront here :D

josevalim commented 12 months ago

We do, but that could be worked out by doing multiple ffmpeg calls. My point is that the complexity now is in the stream composition/processing, not in Nx/Axon/etc. And the former is much easier!

mat-hek commented 12 months ago

So it's just about rewriting client_preprocessing here from Enum to Stream? How would batching be handled then?

josevalim commented 12 months ago

The serving does the batching, although ideally you want to chunk the stream to match the server batch size too.

jonatanklosko commented 12 months ago

Yeah, reading chunks separately from disk is on my list, but it's just an optimisation so we released without it.

As for accepting a stream as serving input, it's a bit different, but definitely doable. Note that for large audio when we split into multiple chunks, we make the chunks overlap. So for a 50s audio we would transcribe like 0-30 and 20-50, this way each transcription has some context and we merge the overlaps accordingly. So if we are given a stream, we need to accumulate until the right size and emit overlapping chunks.

mat-hek commented 12 months ago

this way each transcription has some context and we merge the overlaps accordingly

Yeah, that's actually the reason I'd like to stream to the serving instead of running it for each chunk (as I do now). To make it 'live', I'd need to have at most a few seconds long chunks, ~but from the docs I see that the default is 5 seconds ;)~ hmm, I don't know where I found these 5 seconds, it seems it's just what I set 🤔

jonatanklosko commented 12 months ago

hmm, I don't know where I found these 5 seconds, it seems it's just what I set

The default context is 1/6 of the chunk length, for whisper the chunk is 30s, so the context is 5s (both sides, so it's a 10s overlap).

I'm not sure if we can reasonably handle arbitrarily small chunks (especially as we do context, because then the context would be very small). So I would imagine we accumulate first 30s, then next 20s, next 20s.

mat-hek commented 12 months ago

Small chunks still work pretty well IMO, check Lars's talk where he has a live transcription on slides. From my experience, the accuracy drops for sentences longer than a chunk length, so I guess that context could help here. We can actually provide a lot of 'previous' / 'left side' context without sacrificing latency. The other side context would impact latency, but maybe even 1 or 2 seconds could help, as we wouldn't break words apart.

jonatanklosko commented 12 months ago

I would imagine we accumulate first 30s, then next 20s, next 20s.

Ah, we should accumulate whatever is the chunk_length, so yeah it could as well be smaller.

josevalim commented 11 months ago

Yeah, we can probably transform the stream to either split or accumulate batch size. We can also just do nothing and tell the user that whatever audio size they pass, it will be sent as is, so the buffering is on their own. The latter is the most flexible and likely the simpler too.

jonastemplestein commented 11 months ago

Amazing! I have a little toy project that could really use this (literally, a toy for my daughter that she can speak to).

For my use-case, it is important to minimise the latency after somebody is finished speaking.

Once I detect silence on my end, I'd like to say to bumblebee to "force a chunk", even if it's only been a short time since the last chunk was transcribed.

It would also be really useful to send not just transcribed words to the caller, but also whether or not those words have been "confirmed" by later context (or perhaps the "confidence" in the transcribed words). Whisper is quite good at creating a best guess transcription from a short chunk and often that is good enough to speculatively use. For example, in the context of my voice agent, I might detect silence, force whisper to do what I assume to be a final chunk and send the resulting preliminary transcription onwards to an LLM. But it may turn out the speaker was just making a short pause and resumes speaking. I'd then keep transcribing and if that further context changes the words that I already sent to the LLM, I'll abort the LLM call (provided it's not been read to the user, yet) and re-do it with the new, more correct transcription.

For this to work well, it's valuable to think about how transcripts from overlapping chunks are merged (and how the chunk boundaries are chosen). A good example in the python ecosystem is here: https://github.com/ufal/whisper_streaming

Lots of companies are trying to build low latency voice agents at the moment and I think Elixir would be a great choice for building them, if it had a great realtime transcription implementation. Ideally this would eventually include word-level timestamps and multi-speaker diarization. @jonatanklosko do you know of any efforts in the Elixir community to do this?

BTW regarding the multiple ffmpeg calls per chunk, I think you can probably have a single ffmpeg process that you stream in and out of using stdin and stdout. That would also slightly reduce the latency cost of "booting" an ffmpeg process for each chunk.

josevalim commented 11 months ago

If we stream, we will likely expect pcm chunks, so the ffmpeg conversion would be up to you (which you can do with a live process or even a NIF). @mat-hek and the membrane folks will likely have better ideas here.

mat-hek commented 11 months ago

If we stream, we will likely expect pcm chunks

Seems very reasonable

conversion would be up to you

You can use Membrane for that too 😄 here's a PR with a Livebook example: https://github.com/membraneframework/membrane_demo/pull/249

lawik commented 11 months ago

Is there a difference between it accepting a real stream and repeatedly calling it with the chunk size you want processed?

I guess the current functionality for improving the edges of chunks with overlap and so on suffer when I just send it exactly sized chunks?

@mat-hek as you would know it is not particularly hard to get an appropriate slice of PCM to send it out of Membrane :D.

jonatanklosko commented 11 months ago

@lawik the idea is that we get a stream of continuous chunks, but we would still do overlapping as part of preprocessing and then merging in postprocessing to improve the output.

lawik commented 11 months ago

Awesome!

linusdm commented 11 months ago

Is this discussion targetted at enabling Whisper specifically? Or will these improvements also allow other more general audio processing models (e.g. audio classification models) to benefit from this streaming solution?

jonatanklosko commented 11 months ago

@linusdm Whisper is currently the only audio model we support. I'm not sure how relevant input streaming is for classification models, since they predict a single label rather than streaming transcription.

jonatanklosko commented 6 months ago

361 enables input streaming.

Thinking more about this, I'm not entirely sure if the context overlapping algorithm is very going to be effective with small chunks (as needed for live transcription). The way the algorithm works is that we transcribe two subsequent overlapping chunks of audio, and they should result in two sentences overlapping to some extent at the edges. Then we merge the overlaps to hopefully get the right transcription from the left chunk and from the right chunk. The issue with small chunks is that the sentences are short and there may be very few if any overlapping words. Also note that this means an additional delay, because in order to finish a chunk, we need the transcription from the subsequent overlapping chunk.

So for short chunks it may be better to not use the overlapping chunking and have some other logic, such as splitting input at low amplitude points to avoid cutting mid-word.

These are just high-level thoughts though!

samrat commented 6 months ago

Hello,

I'm trying to use this in a Livebook using kino_live_audio: https://gist.github.com/samrat/fc5792bfc870ad887f29d4a944cafd7d . I'm passing a Stream to the serving, but I'm not seeing any output. Could you help me figure out what I'm doing wrong?

jonatanklosko commented 6 months ago

@samrat the main issue is that you are doing Enum.map instead of Stream.map, so it starts the stream at that point and blocks further execution :) Here's a more minimised example:

.livemd ````markdown # Streaming whisper ```elixir Mix.install( [ {:kino_live_audio, "~> 0.1"}, {:nx, "~> 0.7.1"}, {:bumblebee, github: "elixir-nx/bumblebee"}, {:exla, ">= 0.0.0"}, {:kino, github: "livebook-dev/kino", override: true} ], config: [nx: [default_backend: EXLA.Backend]] ) ``` ## Section ```elixir {:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"}) {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"}) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"}) {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"}) serving = Bumblebee.Audio.speech_to_text_whisper( model_info, featurizer, tokenizer, generation_config, compile: [batch_size: 1], chunk_num_seconds: 6, context_num_seconds: 2, stream: true, defn_options: [compiler: EXLA] ) Kino.start_child({Nx.Serving, serving: serving, name: WhisperServing}) ``` ```elixir liveAudio = KinoLiveAudio.new(chunk_size: 1, unit: :s, sample_rate: featurizer.sampling_rate) ``` ```elixir audio_stream = liveAudio |> Kino.Control.stream() |> Stream.map(fn %{chunk: data} -> Nx.tensor(data) |> Nx.stack() |> Nx.reshape({:auto, 1}) |> Nx.mean(axes: [1]) end) frame = Kino.Frame.new() |> Kino.render() for chunk <- Nx.Serving.batched_run(WhisperServing, audio_stream) do Kino.Frame.append(frame, Kino.Text.new(chunk.text, chunk: true)) end ``` ````

Sidenote: if you look at the console logs and the chunks are not being produced, it may be because the page was denied microphone access.