Medium model output is nonsense for batched pipeline (for short 15s audio clips)

tjongsma commented 2 months ago

Like the title implies, when using the batched commits and using the medium model the model output is nonsense (empty, repeats the inital prompt or says 'I'm sorry'). I'm using something along the following lines:

from faster_whisper import WhisperModel, BatchedInferencePipeline
model_size = "medium"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model
                                        ,use_vad_model=True
segments, info = batched_model.transcribe(audio_tensor,
                                        batch_size=24,
                                        word_timestamps=True)
  text = []
  # Iterate over the segments and store words with timestamps
  for segment in segments:
      for word in segment.words:
          text.append(WordWithTimestamp(word.word, word.start, word.end))

It works fine with both large-v2 and large-v3. Any idea as to why and/or a way to fix it? Thank you!

MahmoudAshraf97 commented 2 months ago

can you upload the audio to reproduce?

tjongsma commented 2 months ago

So I'm using it to do live streaming with whisper, hence my wanting to use the medium model for better latency. This means I'm using a 15 second rolling window of my mic input. I'm using the following video for testing: https://www.youtube.com/watch?v=kYnNSORARFk. I've fixed the complete nonsense data by actually improving data quality (was converting my mic input suboptimally, large-v2 could still decipher it apparently but medium couldn't), but now what I'm running into is that the transcriptions seem to cycle between normal transcriptions for the 15 seconds audio to very shortened versions of it. I'm using the following code in combination with the above to get output as string:

    class WordWithTimestamp:
        def __init__(self, word, start, end):
            self.word = word
            self.start = start
            self.end = end

        def __str__(self):
            return self.word
"".join(str(word) for word in text)

Then for the first ~15s of the clip I linked I get alternatingly e.g. "So give the president a chance. Governor romney, i'm glad that you recognize that al qaeda is a threat. Because a few months ago, when you were asked what's the biggest geopolitical threat facing america, you said russia, not al qaeda." "So give the president a chance." "So" "The" Any ideas on why this happens? I'm starting to think that maybe the batching output is different than the normal output, and the reason I'm getting this problem with medium but not large-v2 is that medium allows my GPU to take advantage of batching more (I'm running it on a laptop 3060 with 6gb of vram).

MahmoudAshraf97 commented 2 months ago

Batching will not be useful for live transcription unless you are doing it over multiple streams/files, also check this

tjongsma commented 2 months ago

Alright, intuitively that makes sense but when I used it with large models it did perform much faster than the unbatched version and gave good results (very similar to the unbatched version). Is there any explanation for that? It feels like there is something there.

And thanks for the link btw, I have tried Whisperlive but I couldn't get it to work as I'd like for my usecase (transcribing meetings). My approach is very similar but incorporates some elements from whisper_streaming. Planning to take a look at https://github.com/backspacetg/simul_whisper too.

tjongsma commented 2 months ago

Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts

MahmoudAshraf97 commented 2 months ago

you can disable fallback by setting temperature to be a single value instead of the default list

tjongsma commented 2 months ago

Thanks that's super useful, almost completely eliminated the very long transcribes!

asr-lord commented 2 months ago

Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts

@tjongsma I've the same issue and the same application, transcription in real-time with chuncks ~3s that some times takes a lot. How you fixed it? Thank you

tjongsma commented 2 months ago

Setting beam_size=5, temperature=0 and max tokens=224 worked for me! Let me know if it does for you too.

SYSTRAN / faster-whisper

Medium model output is nonsense for batched pipeline (for short 15s audio clips) #977