Closed tjongsma closed 2 days ago
can you upload the audio to reproduce?
So I'm using it to do live streaming with whisper, hence my wanting to use the medium model for better latency. This means I'm using a 15 second rolling window of my mic input. I'm using the following video for testing: https://www.youtube.com/watch?v=kYnNSORARFk. I've fixed the complete nonsense data by actually improving data quality (was converting my mic input suboptimally, large-v2 could still decipher it apparently but medium couldn't), but now what I'm running into is that the transcriptions seem to cycle between normal transcriptions for the 15 seconds audio to very shortened versions of it. I'm using the following code in combination with the above to get output as string:
class WordWithTimestamp:
def __init__(self, word, start, end):
self.word = word
self.start = start
self.end = end
def __str__(self):
return self.word
"".join(str(word) for word in text)
Then for the first ~15s of the clip I linked I get alternatingly e.g. "So give the president a chance. Governor romney, i'm glad that you recognize that al qaeda is a threat. Because a few months ago, when you were asked what's the biggest geopolitical threat facing america, you said russia, not al qaeda." "So give the president a chance." "So" "The" Any ideas on why this happens? I'm starting to think that maybe the batching output is different than the normal output, and the reason I'm getting this problem with medium but not large-v2 is that medium allows my GPU to take advantage of batching more (I'm running it on a laptop 3060 with 6gb of vram).
Batching will not be useful for live transcription unless you are doing it over multiple streams/files, also check this
Alright, intuitively that makes sense but when I used it with large models it did perform much faster than the unbatched version and gave good results (very similar to the unbatched version). Is there any explanation for that? It feels like there is something there.
And thanks for the link btw, I have tried Whisperlive but I couldn't get it to work as I'd like for my usecase (transcribing meetings). My approach is very similar but incorporates some elements from whisper_streaming. Planning to take a look at https://github.com/backspacetg/simul_whisper too.
Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts
you can disable fallback by setting temperature
to be a single value instead of the default list
Thanks that's super useful, almost completely eliminated the very long transcribes!
Somewhat related, sometimes even when using the unbatched version faster-whisper will sometimes take a very long time to transcribe an audio clip of <15s (think 8-40s, where it takes about 1s usually). I'm assuming this is caused by perhaps some hallucination issues or fallbacks, are there any settings I can adjust to correct for this behavior? I've noticed it sometimes in transcribing files too but it's of course more of a problem in streaming attempts
@tjongsma I've the same issue and the same application, transcription in real-time with chuncks ~3s that some times takes a lot. How you fixed it? Thank you
Setting beam_size=5, temperature=0 and max tokens=224 worked for me! Let me know if it does for you too.
Like the title implies, when using the batched commits and using the medium model the model output is nonsense (empty, repeats the inital prompt or says 'I'm sorry'). I'm using something along the following lines:
It works fine with both large-v2 and large-v3. Any idea as to why and/or a way to fix it? Thank you!