collabora / WhisperLive

A nearly-live implementation of OpenAI's Whisper.
MIT License
2.1k stars 286 forks source link

"initial_prompt" appears to progressively override audio for longer streams #278

Open AdolfVonKleist opened 2 months ago

AdolfVonKleist commented 2 months ago

I've been using WhisperLive with great success recently in multiple languages. Seriously amazing. I recently noticed the support for initial_prompt which was added in January, and tried applying it to my use case.

I have noticed that while the initial_prompt value works amazingly well during the first 10-20s of a conversation, when we get beyond this point it suddenly starts to completely override the input audio.

For example I'll specify a 'corrected' spelling for a company name: SupaSqrrl DIE-namics instead of Super Squirrel Dynamics. In the first 20s any utterances of this phrase will be perfectly transcribed according to the initial_prompt value I've added: SupaSqrrl DIE-namics. However as the conversation progresses this boosted phrase will start to override all other input speech and the recognizer will just end up outputting the initial_prompt over and over again.

I thought maybe the prompt was being provided repeatedly somewhere in the code, but after a cursory review of the source I didn't see anything like that.

I'm wondering if anyone else has experienced something similar?

edit: I also can confirm I don't see this behavior in longer files when I transcribe in batch mode with whisperx or faster-whisper.

AdolfVonKleist commented 2 months ago

So I dug into this a bit more and was able to confirm that basically two things are happening when I use the websocket connection and fasterwhisper version (I assume it's the same for TensorRT but cannot verify):

  1. This loop is called repeatedly for each new set of samples sent to .transcribe(, however this call segments = self.generate_segments(features, tokenizer, options, encoder_output) via transcribe_audio, and never results in any internal iteration, no matter how long I stream audio to the transcriber. This conditional is called once for every input_sample as well:
    
        if options.initial_prompt is not None:
            if isinstance(options.initial_prompt, str):
                initial_prompt = " " + options.initial_prompt.strip()
                initial_prompt_tokens = tokenizer.encode(initial_prompt)
                all_tokens.extend(initial_prompt_tokens)
            else:
                all_tokens.extend(options.initial_prompt)
the result is that even when 'turned on' the context is never extended with earlier content, it is called once for each new clip with the `initial_prompt` value.  I looked at instead tracking the 'last_segment' returned to `transcribe_audio` in the client, as well as sending the global current timestamp_offset in order to see how I might change/impact the results.  

If I send the initial_prompt only during the first 10-20s of the stream it works well.  Otherwise it starts to override the content of the audio.  I also tried sharing the 'last_segment' by extending `transcribe`:
```python
        result, info = self.transcriber.transcribe(
            input_sample,
            timestamp_offset=self.timestamp_offset,  # added to track global state in transcribe
            last_segment=self.last_segment,  # added to track 'latest' text segment in transcribe
            initial_prompt=self.initial_prompt,
            language=self.language,
            task=self.task,
            vad_filter=self.use_vad,
            vad_parameters=self.vad_parameters if self.use_vad else None)
        self.last_segment=result

this worked a little bit better, but unfortunately seemed to result in a lot of new 'gaps' in the STT results; presumably because the last_segment I'm providing here is not necessarily aligned with the previous clip? In any case, for the moment it appears to be a bust. It's a shame because in the non-streaming/live version this feature is amazingly robust. Here it seems there is at least no 'quick fix' as I had hoped.

It may be just a need to more carefully time-align the 'most recent' partial output with the current clip - like the infrastructure in transcribe implies - but this is currently never actually activated as far as I can tell from my tests the last day or so.

Maybe there's something else I'm missing here as well.

zeliang3 commented 1 month ago

Good day, Sir Could you have more observations on this issue (I do not see this issue in the real-time transcribe from microphone)

By the way, just a question, where is the code below: if options.initial_prompt is not None: if isinstance(options.initial_prompt, str): initial_prompt = " " + options.initial_prompt.strip() initial_prompt_tokens = tokenizer.encode(initial_prompt) all_tokens.extend(initial_prompt_tokens) else: all_tokens.extend(options.initial_prompt)

AdolfVonKleist commented 1 month ago

@zeliang3 it is here: https://github.com/collabora/WhisperLive/blob/be71657397b6c51fcfa2e760aacfc2f5f71bae9d/whisper_live/transcriber.py#L464

I haven't had a chance to look at it closely again. I see it constantly in the websocket. I'm using it in streaming mode over a websocket in a ReactJS web application. Can you provide a minimum usage example for your microphone based approach? I have not tried this myself. Maybe I'll have better luck comparing it against a working alternative.

I'll be happy to invest another day or so in this and provide a pull request if I can suss it out; but I either need a bit more free time, or some kind of hint.

zeliang3 commented 1 month ago

just simply call client(), and it will choose the current microphone bro @AdolfVonKleist

from whisper_live.client import TranscriptionClient

client = TranscriptionClient(
  "192.168.1.100",
  9090,
  translate=False,
  model="large",
  save_output_recording=True,                         
  output_recording_filename="./output_recording.wav"
)

client()