Mozer / talk-llama-fast

Port of OpenAI's Whisper model in C/C++ with xtts and wav2lip
MIT License
708 stars 64 forks source link

whisper.cpp #8

Closed BahamutRU closed 3 months ago

BahamutRU commented 3 months ago

Rofl? Why not faster-whisper? Faster, smaller, better.

And streaming mode is terrible, sure. But, for speed…

Software is usual and simple, but all-in-one.

Try faster-whisper. =)

Mozer commented 3 months ago

How fast is it for a short phrase? (e.g. How are you?). I have to check, but i don't think it will be faster than 0.22s that i managed to get with distilled whisper cpp medium in English.

Mozer commented 3 months ago

I have tested it. With default settings faster whisper is a little bit slower than whisper.cpp in my project for short phrases. I am getting 0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.

For long phrases faster whisper is better. But the main usage in my project - is to transcribe short phrases.

Code for distilled medium en. (not distilled is slower and takes even more vram)

from faster_whisper import WhisperModel
import time

model_size = "distil-medium.en"

model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False)

print(time.time())
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())   
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False) 
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())    
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False) 
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time()) 
>python test.py
1713012570.1575649
[0.00s -> 2.00s]  fucking music or what?
1713012570.5819616
[0.00s -> 2.00s]  fucking music or what?
1713012570.8380058
[0.00s -> 2.00s]  fucking music or what?
1713012571.093508
Mozer commented 3 months ago

There is also whisperX that can do inference in batches, but it will be using lots of vram.

BahamutRU commented 3 months ago

0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.

Hm, I see. You split the whole speech into small pieces, for faster reaction? And this 0.03s transform into multiple delay? Size of vram is not a priority?

Okay, it's you business. =)

Sorry, you're right.

GL!

BahamutRU commented 3 months ago

And about xtts, if you use streaming, this makes quality bad. =) But, no another option, I understand…

dmikushin commented 3 months ago

Thank you both for this discussion. I've added highlights to the README: https://github.com/dmikushin/mozer-llama-fast/commit/895c324710c8bc627e0bc54c162e19adde625086