Open AdolfVonKleist opened 2 months ago
So I dug into this a bit more and was able to confirm that basically two things are happening when I use the websocket connection and fasterwhisper version (I assume it's the same for TensorRT but cannot verify):
.transcribe(
, however this call segments = self.generate_segments(features, tokenizer, options, encoder_output)
via transcribe_audio
, and never results in any internal iteration, no matter how long I stream audio to the transcriber. This conditional is called once for every input_sample as well:
if options.initial_prompt is not None:
if isinstance(options.initial_prompt, str):
initial_prompt = " " + options.initial_prompt.strip()
initial_prompt_tokens = tokenizer.encode(initial_prompt)
all_tokens.extend(initial_prompt_tokens)
else:
all_tokens.extend(options.initial_prompt)
the result is that even when 'turned on' the context is never extended with earlier content, it is called once for each new clip with the `initial_prompt` value. I looked at instead tracking the 'last_segment' returned to `transcribe_audio` in the client, as well as sending the global current timestamp_offset in order to see how I might change/impact the results.
If I send the initial_prompt only during the first 10-20s of the stream it works well. Otherwise it starts to override the content of the audio. I also tried sharing the 'last_segment' by extending `transcribe`:
```python
result, info = self.transcriber.transcribe(
input_sample,
timestamp_offset=self.timestamp_offset, # added to track global state in transcribe
last_segment=self.last_segment, # added to track 'latest' text segment in transcribe
initial_prompt=self.initial_prompt,
language=self.language,
task=self.task,
vad_filter=self.use_vad,
vad_parameters=self.vad_parameters if self.use_vad else None)
self.last_segment=result
this worked a little bit better, but unfortunately seemed to result in a lot of new 'gaps' in the STT results; presumably because the last_segment
I'm providing here is not necessarily aligned with the previous clip? In any case, for the moment it appears to be a bust. It's a shame because in the non-streaming/live version this feature is amazingly robust. Here it seems there is at least no 'quick fix' as I had hoped.
It may be just a need to more carefully time-align the 'most recent' partial output with the current clip - like the infrastructure in transcribe
implies - but this is currently never actually activated as far as I can tell from my tests the last day or so.
Maybe there's something else I'm missing here as well.
Good day, Sir Could you have more observations on this issue (I do not see this issue in the real-time transcribe from microphone)
By the way, just a question, where is the code below: if options.initial_prompt is not None: if isinstance(options.initial_prompt, str): initial_prompt = " " + options.initial_prompt.strip() initial_prompt_tokens = tokenizer.encode(initial_prompt) all_tokens.extend(initial_prompt_tokens) else: all_tokens.extend(options.initial_prompt)
@zeliang3 it is here: https://github.com/collabora/WhisperLive/blob/be71657397b6c51fcfa2e760aacfc2f5f71bae9d/whisper_live/transcriber.py#L464
I haven't had a chance to look at it closely again. I see it constantly in the websocket. I'm using it in streaming mode over a websocket in a ReactJS web application. Can you provide a minimum usage example for your microphone based approach? I have not tried this myself. Maybe I'll have better luck comparing it against a working alternative.
I'll be happy to invest another day or so in this and provide a pull request if I can suss it out; but I either need a bit more free time, or some kind of hint.
just simply call client(), and it will choose the current microphone bro @AdolfVonKleist
from whisper_live.client import TranscriptionClient
client = TranscriptionClient(
"192.168.1.100",
9090,
translate=False,
model="large",
save_output_recording=True,
output_recording_filename="./output_recording.wav"
)
client()
I've been using WhisperLive with great success recently in multiple languages. Seriously amazing. I recently noticed the support for
initial_prompt
which was added in January, and tried applying it to my use case.I have noticed that while the
initial_prompt
value works amazingly well during the first 10-20s of a conversation, when we get beyond this point it suddenly starts to completely override the input audio.For example I'll specify a 'corrected' spelling for a company name: SupaSqrrl DIE-namics instead of Super Squirrel Dynamics. In the first 20s any utterances of this phrase will be perfectly transcribed according to the initial_prompt value I've added:
SupaSqrrl DIE-namics
. However as the conversation progresses this boosted phrase will start to override all other input speech and the recognizer will just end up outputting the initial_prompt over and over again.I thought maybe the prompt was being provided repeatedly somewhere in the code, but after a cursory review of the source I didn't see anything like that.
I'm wondering if anyone else has experienced something similar?
edit: I also can confirm I don't see this behavior in longer files when I transcribe in batch mode with whisperx or faster-whisper.