Closed BahamutRU closed 3 months ago
How fast is it for a short phrase? (e.g. How are you?). I have to check, but i don't think it will be faster than 0.22s that i managed to get with distilled whisper cpp medium in English.
I have tested it. With default settings faster whisper is a little bit slower than whisper.cpp in my project for short phrases. I am getting 0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.
For long phrases faster whisper is better. But the main usage in my project - is to transcribe short phrases.
Code for distilled medium en. (not distilled is slower and takes even more vram)
from faster_whisper import WhisperModel
import time
model_size = "distil-medium.en"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False)
print(time.time())
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())
>python test.py
1713012570.1575649
[0.00s -> 2.00s] fucking music or what?
1713012570.5819616
[0.00s -> 2.00s] fucking music or what?
1713012570.8380058
[0.00s -> 2.00s] fucking music or what?
1713012571.093508
There is also whisperX that can do inference in batches, but it will be using lots of vram.
0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.
Hm, I see. You split the whole speech into small pieces, for faster reaction? And this 0.03s transform into multiple delay? Size of vram is not a priority?
Okay, it's you business. =)
Sorry, you're right.
GL!
And about xtts, if you use streaming, this makes quality bad. =) But, no another option, I understand…
Thank you both for this discussion. I've added highlights to the README: https://github.com/dmikushin/mozer-llama-fast/commit/895c324710c8bc627e0bc54c162e19adde625086
Rofl? Why not faster-whisper? Faster, smaller, better.
And streaming mode is terrible, sure. But, for speed…
Software is usual and simple, but all-in-one.
Try faster-whisper. =)