Open 1848 opened 3 weeks ago
A 2-second clip isn't a great test - the model takes time to load. Your measured time may be mostly model load rather than transcription. I would test on at least a 30-second clip.
The OpenAI Whisper models are optimized for GPUs. The original whisper ones are much slower on CPUs - though the difference is reduced with faster-whisper.
Some reference numbers from my testing with the "small" model: 6 min 30 second audio in English
Original OpenAI Whisper: small, beam size 5. Ryzen 5 5600G desktop CPU: 4m 22s NVIDIA GTX 1050 Ti Max-Q: 1m31s
Faster-whisper: int8 quantized model, beam size 5, 4 CPU threads for CPU mode: Ryzen 5 5600G desktop CPU: 0m54s Ryzen 5 5600U laptop CPU: 1m3s NVIDIA GTX 1050 Ti Max-Q: 0m28s
For faster-whisper I found that more than 4 CPU threads made little difference. Memory bandwidth matters as single RAM stick increased time by 50% or so.
Hi,
I already found #279 and I think I have the same issue. I am using the code from #279 for benchmarking (only checking faster-whisper). A 2 seconds wav-file needs nearly 3 seconds to process with the "small" model. Using large-v3 takes 12 seconds. Running a VM with 8 cores (2,4ghz) and 16GB memory, no GPU. Python 3.12 and faster-whisper 1.0.3
I dont think this performance is expected, right?