ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.38k stars 3.61k forks source link

CPU Performance Regression? (Older version much faster) #2099

Open nanocosmos-ol opened 6 months ago

nanocosmos-ol commented 6 months ago

I compared an older version from Nov 23 with Apr 24, and the older version is much faster.

total time = 6225.76 ms vs total time = 3817.54 ms

Same CPU, same compiler and settings, same test:

CPU: AMD Ryzen 9 7950X3D 16-Core

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 whisper_model_load: CPU total size = 147.37 MB whisper_model_load: model size = 147.37 MB whisper_init_state: kv self size = 16.52 MB whisper_init_state: kv cross size = 18.43 MB whisper_init_state: compute buffer (conv) = 16.39 MB whisper_init_state: compute buffer (encode) = 132.07 MB whisper_init_state: compute buffer (cross) = 4.78 MB whisper_init_state: compute buffer (decode) = 96.48 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

whisper_print_timings: load time = 64.61 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 878.59 ms / 1 runs ( 878.59 ms per run) whisper_print_timings: decode time = 935.20 ms / 256 runs ( 3.65 ms per run) whisper_print_timings: batchd time = 544.69 ms / 320 runs ( 1.70 ms per run) whisper_print_timings: prompt time = 3865.51 ms / 4096 runs ( 0.94 ms per run) whisper_print_timings: total time = 6225.76 ms

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 whisper_model_load: model ctx = 140.66 MB whisper_model_load: model size = 140.54 MB whisper_init_state: kv self size = 5.25 MB whisper_init_state: kv cross size = 17.58 MB whisper_init_state: compute buffer (conv) = 18.50 MB whisper_init_state: compute buffer (encode) = 81.95 MB whisper_init_state: compute buffer (cross) = 4.49 MB whisper_init_state: compute buffer (decode) = 24.70 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

whisper_print_timings: load time = 83.24 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 693.48 ms / 1 runs ( 693.48 ms per run) whisper_print_timings: decode time = 874.80 ms / 256 runs ( 3.42 ms per run) whisper_print_timings: prompt time = 2249.08 ms / 16 runs ( 140.57 ms per run) whisper_print_timings: total time = 3817.54 ms

See https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-2081571638

przemoc commented 6 months ago

Thank you for the report.

Can you provide what's your current OS and compiler? Were they the same one for the older commit? EDIT: Sorry, I missed that you confirmed it's the same compiler.

Could you try running make with AVX512F_M= AVX512VNNI_M= AVX512VBMI_M= so that AVX512 would not be used?

That could make your new run potentially a bit more comparable to the old one. (I don't know if slow AVX512 may be the issue here, but it may be worth trying.)

nanocosmos-ol commented 6 months ago

It is Ubuntu 22.04.4. All running on the same machine in different folders, fresh compiled.

Without AVX512 it is a bit better indeed, but still not the same, somehow in the middle.

total time = 5086.27 ms

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 whisper_model_load: CPU total size = 147.37 MB whisper_model_load: model size = 147.37 MB whisper_init_state: kv self size = 16.52 MB whisper_init_state: kv cross size = 18.43 MB whisper_init_state: compute buffer (conv) = 16.39 MB whisper_init_state: compute buffer (encode) = 132.07 MB whisper_init_state: compute buffer (cross) = 4.78 MB whisper_init_state: compute buffer (decode) = 96.48 MB

system_info: n_threads = 4 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

whisper_print_timings: load time = 56.47 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 0.00 ms whisper_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run) whisper_print_timings: encode time = 852.86 ms / 1 runs ( 852.86 ms per run) whisper_print_timings: decode time = 622.78 ms / 256 runs ( 2.43 ms per run) whisper_print_timings: batchd time = 323.46 ms / 320 runs ( 1.01 ms per run) whisper_print_timings: prompt time = 3286.05 ms / 4096 runs ( 0.80 ms per run) whisper_print_timings: total time = 5086.27 ms

przemoc commented 6 months ago

You may want to try also with --beam-size 2, as that's what seems to be the default in the older commit. It was changed in b6c5f49b78b214b7b4aa7392a8ba489c78b7382a. As Georgi commented in some other issue:

The quality with more beams in general should be better, but it's possible that you don't observe much of a difference

nanocosmos-ol commented 6 months ago

It looks like the former default was beam-size=-1 ? This then switches the strategy WHISPER_SAMPLING_BEAM_SEARCH / WHISPER_SAMPLING_GREEDY

bench doesnt support beam-size, so I am trying a real wav file, and it does improve the speed (still not like the old version but closer)

default branch (new, with AVX512):

AVX512=1 beam-size 5 default total time = 20678.01 ms

AVX512=1 beam-size 2 total time = 17052.18 ms

AVX512=1 beam-size -1 total time = 15465.98

AVX512=0 beam-size 5 default total time = 19365.01 ms

AVX512=0 beam-size 2 total time = 15219.21 ms

AVX512=0 beam-size -1 total time = 13869.20 ms

Old version:

AVX512=0 beam-size 5 total time = 21862.52 ms

AVX512=0 beam-size 2 total time = 14704.33 ms

AVX512=0 beam-size -1 default total time = 12398.81 ms

Interesting results, especially the AVX issue. We'll play around with it a bit.

Thanks for your help!

(note: beam-search default seems to have changed from -1 to 2 to 5 now : https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp#L4625 )

przemoc commented 6 months ago

It looks like the former default was beam-size=-1 ?

I was referring to changes in whisper_full_default_params, where beam_search.beam_size changed from 2 to 5, but you're right that whisper_params.beam_size previously did not use whisper_full_default_params(), and it was set to -1.

So you may want to try --beam-size 1 too, I guess.

Fox AVX-512 vs Ryzen let me mention: Zen4's AVX512 Teardown

Ubuntu 22.04 has relatively old compiler. Results from more recent maybe could be different.


I'm wondering if WHISPER_NO_AVX512 shouldn't be introduced in Makefile, to make it easier to disable AVX-512 (setting 3 variables is relatively cumbersome). Maybe we should even set such WHISPER_NO_AVX512 to 1 by default, but we would need to have bigger sample to be able to decide if more folks are harmed performance-wise by having AVX-512 enabled than by having it disabled. Autodetection that is done in Makefile assumes that adding more ISA extensions allow compiler to do better job (produce more efficient code), but that may not always be the case, as we can see in this issue.

Linux13524 commented 6 months ago

We are experiencing a similar behavior when comparing version 1.4.3 with the latest 1.5.5. But since we are using CMake for the build I guess it cannot be related to AVX512, because WHISPER_NO_AVX512 is set by default, right?

Also we are not using beam search (by setting whisper_full_default_params(WHISPER_SAMPLING_GREEDY)) so this should also not affect the performance, right?

Seems the greedy.best_of also changed from 2 to 5. But when I change it back the performance does not change much. So I guess this is also unrelated..

przemoc commented 6 months ago

Could you do git bisect between good (v1.4.3) and bad (v1.5.5) to try to locate the main commit responsible for performance drop in your environment? Less than 10 steps (experiments) should suffice.

Linux13524 commented 6 months ago

Ok, so my git bisect gives me the following result:

3e5c7feeffb86555d63ef592f79ce8365a069174 is the first bad commit
commit 3e5c7feeffb86555d63ef592f79ce8365a069174
Author: Evan Jones <evan.q.jones@gmail.com>
Date:   Mon Nov 13 03:51:34 2023 -0500

    whisper : add grammar-based sampling (#1229)

    * whisper : add grammar-based sampling

    * build : fix after master merge

    * command : fix exception when recognizing the command

    * whisper : fine-tuning grammar functionality

    * command : grammar-related improvements

    - option to read grammar from file
    - add sample grammars for colors and chess moves
    - fine-tune the performance further

    * grammars : add assistant + update comments

    * command : enable beam-search, add "no_timestamps", add "context", add p

    * whisper : remove comment

    ---------

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Any idea how this commit could influence the performance in such a bad way?

BTW: The performance drop on our side is about ~40%

Linux13524 commented 6 months ago

I dropped this comment to test if it will fix the performance problem but it didn't. So I did another git bisect and the next bad commit is "whisper : add batched decoding (#1486)" (b6c5f49b78b214b7b4aa7392a8ba489c78b7382a), but I don't think I can easily drop this one..

Is there anything else I can try out based on these informations?

przemoc commented 5 months ago

@ggerganov, do you have any ideas what else @Linux13524 could try or tweak in pursuit of restoring whisper.cpp performance from 1.4.x in 1.5.x?

ggerganov commented 5 months ago

Hm not sure. @Linux13524 Is this CPU-only or using CUDA / Metal backend?

Linux13524 commented 5 months ago

We first noticed it while testing the new CUDA performance, but my git bisects from above I did using CPU-only. The CUDA performance drop could also be due to something else tho. I cannot test this easily due to the lack of a NVIDIA GPU in my notebook.

przemoc commented 5 months ago

Just a side comment and follow-up to my earlier comment:

I'm wondering if WHISPER_NO_AVX512 shouldn't be introduced in Makefile, to make it easier to disable AVX-512 (setting 3 variables is relatively cumbersome).

I made:

Linux13524 commented 4 months ago

Any update on this? It seems we still have these performance issues on the latest version.