Open geekodour opened 1 year ago
I've just submitted a pull request that aims to address a few of these issues. Currently, OpenBLAS isn't active on the Windows platform, even though the previously released binary file is named whisper-blas-bin-x64. When OpenBLAS is enabled, it boosts CPU inferencing speeds by a factor of 3-4. I ran some tests on my i7-12700H, using the -w 2 flag for matrix multiplication. I found that it achieves at least 50% of the theoretical maximum (OpenBLAS enabled).
Does this mean we're at 30 seconds compared to faster-whispher's 14 seconds?
For this example there are 2 main reasons explaining why faster-whisper is faster:
Disclaimer: I'm the author of faster-whisper.
Just throwing in there that faster-whisper is quicker than whisper.cpp on the GPU as well.
Using an RTX 4080 on Ubuntu 22.04, a 12min audio sample takes 3.4min to transcribe using whisper.cpp with a medium
model while faster-whisper does it in 30s using the higher quality large-v2
model. medium
model brings it down to 20s. It's seriously impressive.
@guillaumekln has batched beam search still not been implemented?
The related issue #1048 is still open so I don't think it is implemented yet.
Going to take a crack at bringing over the implementation from llama.
Fair warning, I am not very experienced with C/C++. Will link the PR here once ready for review.
Could you run another test on the latest version of whisper.cpp? I'm curious to see how much we've improved since last month. You can find the latest version in PR #1243. Thanks! @geekodour
Please use OpenBLAS (64bit) eg. openblas64-dev
, and use the following command for testing:
./main -bs 5 -bo 5 -t 8 -f steve2.wav -m models/ggml-small.en.bin
Going to take a crack at bringing over the implementation from llama.
Fair warning, I am not very experienced with C/C++. Will link the PR here once ready for review.
Any progress?
Could you run another test on the latest version of whisper.cpp? I'm curious to see how much we've improved since last month. You can find the latest version in PR #1243. Thanks! @geekodour
Please use OpenBLAS (64bit)
eg. openblas64-dev
, and use the following command for testing:./main -bs 5 -bo 5 -t 8 -f steve2.wav -m models/ggml-small.en.bin
In terms of CPU performance, whisper.cpp
* isn't lagging too far behind. To give you an idea, our latest tests were run on an i7-12700H
, using 4
threads and a beam size
of 5
.
diffusion2023-07-03.wav 27m:49s |
Implementation | Precision | Beam size | Time |
---|---|---|---|---|
whisper.cpp (1ee2707) | FP32 | 5 | 9m:45s | |
faster-whisper (ad388cd) | FP32 | 5 | 7m:30s |
@bobqianic We should better have the batched decoding implemented before additional tests. Without it whisper.cpp
will always be significantly slower
@bobqianic We should better have the batched decoding implemented before additional tests. Without it
whisper.cpp
will always be significantly slower
Agree.
No progress on this yet from me. Will update here with draft PR when I have something.
I did some measurements of my own that I'd like to share along with some observations and maybe perf recommendations for this great project.
Audio Context size 1500 (default) | # | Method | Compute type | Relative speedup | Mean time | Std Dev |
---|---|---|---|---|---|---|
#1 | WhisperCpp | - | 1.0 | 4723.80 | 15.13 | |
#27 | WhisperCpp | BLAS | 1.72 | 2745.40 | 14.72 | |
#2 | CT2InsideCpp | BLAS f32 | 2.28 | 2068.30 | 18.09 | |
#3 | CT2InsideCpp | BLAS i8 | 1.59 | 2976.30 | 27.16 | |
#4 | CT2InsideCpp | MKL f32 | 2.76 | 1712.00 | 24.04 | |
#5 | CT2InsideCpp | MKL i8 | 2.14 | 2210.60 | 28.87 | |
#6 | CT2 | BLAS f32 | 2.68 | 1762.10 | 33.39 | |
#7 | CT2 | BLAS i8 | 1.78 | 2646.40 | 16.97 | |
#8 | CT2 | MKL f32 | 3.37 | 1403.10 | 26.42 | |
#9 | CT2 | MKL i8 | 2.52 | 1873.60 | 8.66 | |
#10 | CT2 (No MEL) | BLAS f32 | 2.73 | 1730.90 | 12.55 | |
#11 | CT2 (No MEL) | BLAS i8 | 1.81 | 2615.70 | 15.70 | |
#12 | CT2 (No MEL) | MKL f32 | 3.43 | 1379.00 | 13.19 | |
#13 | CT2 (No MEL) | MKL i8 | 2.54 | 1856.70 | 10.24 |
Audio Context size 512 (Seems to strike well accuracy vs performance #166) | # | Method | Compute type | Relative speedup | Mean time | Std Dev |
---|---|---|---|---|---|---|
#14 | WhisperCpp | - | 1.0 | 1349.00 | 37.21 | |
#28 | WhisperCpp | BLAS | 1.76 | 766.60 | 12.42 | |
#15 | CT2InsideCpp | BLAS f32 | 1.48 | 909.00 | 20.83 | |
#16 | CT2InsideCpp | BLAS i8 | 1.29 | 1042.60 | 12.34 | |
#17 | CT2InsideCpp | MKL f32 | 2.42 | 556.40 | 7.62 | |
#18 | CT2InsideCpp | MKL i8 | 1.99 | 677.30 | 10.56 | |
#19 | CT2 | BLAS f32 | 1.71 | 791.10 | 11.68 | |
#20 | CT2 | BLAS i8 | 1.45 | 932.00 | 6.88 | |
#21 | CT2 | MKL f32 | 3.03 | 444.80 | 12.61 | |
#22 | CT2 | MKL i8 | 2.38 | 566.40 | 7.69 | |
#23 | CT2 (No MEL) | Blas f32 | 1.76 | 767.90 | 4.25 | |
#24 | CT2 (No MEL) | Blas i8 | 1.48 | 910.90 | 3.51 | |
#25 | CT2 (No MEL) | MKL f32 | 3.15 | 427.80 | 7.35 | |
#26 | CT2 (No MEL) | MKL i8 | 2.44 | 552.90 | 20.40 |
Legend: CT2 - https://github.com/OpenNMT/CTranslate2 CT2 (No MEL) - MEL was pre-computed before benchmark. Just for perf-comparison of MEL vs Encode-Decode stages WhisperCpp - Just downloaded and compiled as is, with /arch:AVX2, not sure what compute type is default CT2InsideCpp - WhisperCpp frontend, Encode-Decode stages replaced with callbacks to CT2 MKL - https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html BLAS - https://github.com/OpenMathLib/OpenBLAS
Obsevations: 1) Gj on finding this Audio Context size reduction trick. I don't care if it's documented feature of the model or not, it works and makes processing on embedded devices actually viable. Thank you!
2) In https://github.com/ggerganov/whisper.cpp/discussions/589#discussioncomment-5265713 the question was raised how MKL compares to OpenBLAS. In https://github.com/ggerganov/whisper.cpp/discussions/589#discussioncomment-5289714 guillaumekln observed ~70% better perf thanks to MKL. I second this, with a small caviat. In my measurements this was the case only for AC=512 (#19 vs #21 => 78%). For full AC (#6 vs #8) the improvement was only 25%. The difference likely comes from much smaller data used and prolly smaller model too. Still, very nice gains, no issues with installation on my side, so it could be very low hanging fruit for ggml improvement.
3) INT8 compute mode turned out to be perf regression across the board compared to FLOAT32. This was very interesting for me, I belive this is result of small data and model running on "sufficiently powerful" HW. What is normally memory-bound program, managed to comfortably fit into L3 at every computation step thus turning the program into compute-bound problem, so the memory footprint reduction had no effect and additional compute load on i8 coversions increased the times. Will definitely follow up on this when porting to RPi. Much less cache, could become memory-bound again on f32s. Let me know if you have different hypotesis on this.
4) The #2 #3 #4 #5 vs #6 #7 #8 #9 respectively and #15 #16 #17 #18 vs #19 #20 #21 #22 respectively show overhead of Whisper.cpp frontend vs pure CT2 model. This is particularly interesting as it shows that there is potential to optimize the Whisper.cpp by 12-25% just by reducing data shuffles, runtime allocations and runtime tensor graph preparations.
For reproduction I attach sources of my benchmark as well as patch to Whisper.cpp that was used to expose MEL and replace Encode-Decode steps.
Disclaimers: 1) Runtimes measure only actual evaluation. Loading times of the models, teardown and first warmup run is not considered. Reasoning is simple, on short transcriptions we typically care about response time so the model will be preloaded, on large transcriptions the loading time amortizes to nothing.
2) My implementation of Audio Context size reduction is not correct for multi-pass decoding (+30s clips). Whisper.cpp nailed this down, I did not bother, it is out of my target use-case for now.
3) All measurements were done on Win10 i7-9700K MSVC Whisper Base model (multi-lang) and following ~7s audio voice command: https://github.com/Picovoice/rhino/blob/4a69bd13dceed859911f4e360dfde0cdb9d9fbf0/resources/audio_samples/test_within_context.wav
WhisperCpp - Just downloaded and compiled as is, with /arch:AVX2, not sure what compute type is default
That's FP32. I believe if you give the OpenBLAS version a try, you'll find its performance quite similar to CT2, with hardly any noticeable difference. (v1.5.4)
Great, I completely missed that in docs, thanks. Should be BIG RED, if you want perf, enable this :)
So for the BLAS backend I added measurements #27 and #28 to the tables above. Solid ~1.7x perf improvement, on AC 512 it actually beats CT2. On 1500 there is something more to it tho, not even close.
In the ggml code I also noticed there is some support for MKL too, tried to measure it, but it's throwing on me, not sure why. Maybe you'll see what's going on? Maybe support for it is not finished yet?
Intel oneMKL ERROR: Parameter 9 was incorrect on entry to cblas_sgemm.
ne1 = 1500
ne01 = 512
ne10 = 512
ne00 = 512
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
ne1, ne01, ne10,
1.0f, y, ne10,
x, ne00,
0.0f, d, ne01);
.exe!ggml_compute_forward_mul_mat(const ggml_compute_params * params, ggml_tensor * dst) Line 10629 C
.exe!ggml_compute_forward(ggml_compute_params * params, ggml_tensor * tensor) Line 16061 C
.exe!ggml_graph_compute_thread(void * data) Line 18157 C
.exe!ggml_graph_compute(ggml_cgraph * cgraph, ggml_cplan * cplan) Line 18490 C
.exe!ggml_backend_cpu_graph_compute(ggml_backend * backend, ggml_cgraph * cgraph) Line 809 C
.exe!ggml_backend_graph_compute_async(ggml_backend * backend, ggml_cgraph * cgraph) Line 282 C
.exe!ggml_backend_graph_compute(ggml_backend * backend, ggml_cgraph * cgraph) Line 276 C
.exe!ggml_graph_compute_helper(ggml_backend * backend, ggml_cgraph * graph, int n_threads) Line 190 C++
.exe!whisper_encode_internal(whisper_context & wctx, whisper_state & wstate, const int mel_offset, const int n_threads, bool(*)(void *) abort_callback, void * abort_callback_data) Line 2088 C++
.exe!whisper_full_with_state(whisper_context * ctx, whisper_state * state, whisper_full_params params, const float * samples, int n_samples) Line 5247 C++
.exe!whisper_full(whisper_context * ctx, whisper_full_params params, const float * samples, int n_samples) Line 5943 C++
It looks like compiled with OpenBLAS is actually worse on Raspberry Pi 5 (1743.21 ms vs. 6232.27 ms - I ran jfk sample a few times, while the numbers differed slightly, the overall result was the same).
I did a very rough comparison of https://github.com/guillaumekln/faster-whisper and whisper.cpp, turns out faster-whisper is faster than whisper.cpp in CPU.
For eg. It takes faster-whisper 14seconds with the
small.en
, whereas with whisper.cpp it's 46seconds. What causes this slowness? Or I am not setting parameters correctly, I tried keeping the beam size and threads similar.I have a suspicion that I am not doing the comparison correctly, it'll be awesome if someone more knowledgeable can tell why faster-whisper is faster on CPU
I think I am comparing int8(faster-whisper) to int4(https://huggingface.co/ggerganov/whisper.cpp) quantization here. But not sure how much of a difference should it make.
See comparison here: https://gist.github.com/geekodour/8734b3bf22b8ede61fb5bfc92ce68fe3