ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.68k stars 3.63k forks source link

Recommendations for performance when running whisper.cpp on VPS? #524

Open jalustig opened 1 year ago

jalustig commented 1 year ago

I'm experimenting with running whisper at scale on a VPS cluster, but am not getting good performance, it is quite slow even on dedicated CPU hardware. Here are the CPU stats which are being output when I run ./main: system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Is the lack of BLAS one potential reason why it's slow? I have also specifically built it with OpenBLAS, for some reason it isn't actually running with BLAS.

nxtreaming commented 1 year ago

I use the following VPS configurartion, it can reach 90% realtime, ie 36min audio needs 40min to process

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

ggerganov commented 1 year ago

Don't think you can do anything at the moment to improve the performance. In the future, quantised models might be useful for such use-cases, so keep track of progress in #540

sigaloid commented 1 year ago

In my experience it's very hard to improve performance without offloading to GPU. Even throwing dramatically more cores at it does not work. AMD EPYC 7532 at 128/128 threads runs no faster than 12/128.

The sweet spot is probably 6-8 cores, quantized if accuracy allows, and scaling out the workload across a cluster.

I wish whisper.cpp scaled better, though there's some performance discussion in #200 and hopefully this can be improved over time.

Even when running with triple Titan RTX's, a Xeon E5-2643 v4 24-core, and 512GB of ram, I only get 1.66x realtime for the large model (~7 min for ~13min audio). If nothing else, this proves you cannot throw more resources at it to speed it up.

When compared to openai/whisper proper, it handles offloading to cuda devices much more efficiently - same machine with whisper runs about 7x realtime.

When compared to openai/whisper CPU, however, whisper.cpp pulls ahead by a long shot. I don't have exact numbers but roughly 0.33x realtime? on the large model.