Incredibly slow on Windows with CPU having AVX support

polkovnikov commented 1 year ago

I've compiled your command and main example projects with latest 16-th Clang on Windows.

I have CPU Intel i7-2630QM @ 2.00GHz, which has 4 cores (8 hardware threads). And CPU has AVX support.

Used -O3 -march=native option, so it means full optimization and using all CPU features.

When I use Command program to say a short phrase, after it prints Speech detected! Processing ... it takes 50-60 seconds to output resulting transcription.

Same is with Main program, if I provide it with WAV having short phrase, it takes also 50-60 seconds to process it and output recognized text.

Note. I did my own compilation of your program from Command Line, I didn't use your Make or CMake files. It could be the reason why it slowed down. But if I provide -O3 -march=native to Clang then I see no reason why should be there any problem then. I need to build it from command line only because I'm integrating Whisper.Cpp into my own project that has its own C++ build system.

prusnak commented 1 year ago

What does the system_info line of the output say?

Find a line that looks like this:

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

polkovnikov commented 1 year ago

@prusnak Just AVX + SSE3. Also, I don't have even F16C, I checked that in cpuinfo.

system_info: n_threads = 8 / 8 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

misutoneko commented 1 year ago

Have you tried with, say, 4 threads if it still behaves the same? I think I saw a graph somewhere that you don't get that much benefit from extra threads anyway. EDIT: Yup, take a look at issue #200.

Also, is it possible that you've run out of memory? Idk, just something to check.

polkovnikov commented 1 year ago

@misutoneko I've tried on both laptops that I have. Both of them are bought around 2009-2012 year.

Another one with 2 cores (2 hardware threads) also gives very slow result, above 50 seconds. It has no AVX, just SSE3.

When I reduce threads from 8 to 4 or even to 1, things get even slower, especially with one thread.

Also, I have a question, should I look at encode or decode time? See console output below:

whisper_print_timings:   encode time = 184388.62 ms /     2 runs (92194.31 ms per run)
whisper_print_timings:   decode time =  3506.90 ms /    17 runs (  206.29 ms per run)

What here encode and decode mean? Seems to me that encode is 1000 times slower than decode. Is it alright?

Also I've tried python package of whisper, which is

python -m pip install git+https://github.com/openai/whisper.git

It allows you to run a command like this

whisper --model base.en --language en test.wav

This command takes only 5-10 seconds to recognize, unlike Whisper.Cpp which took 50 and more seconds.

But as I saw in code Python version uses PyTorch package and model. Hence it is much more optimized than whisper.cpp, it could be the reason of great speedup.

misutoneko commented 1 year ago

OK I guess it's unlikely to be a memory issue with base.en. Did you set build type to "Release"? I think the default is Debug and that one is slow. EDIT: The build type thing is CMake related (issue #33), and I see you're not using CMake. So scratch that.

ulatekh commented 3 months ago

I find Whisper is incredibly slow unless CUDA support is enabled.

ggerganov / whisper.cpp

Incredibly slow on Windows with CPU having AVX support #630