ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
34.87k stars 3.55k forks source link

Fallback from Vulkan to CPU #2411

Open thewh1teagle opened 3 weeks ago

thewh1teagle commented 3 weeks ago

Vulkan has a lot of bugs on Windows / Linux. but when it works, it works much faster than CPU. (10-20x faster) I'm forced to use Vulkan in the project vibe but many users report that it's crash on Windows / Linux.

Some of the errors:

PopOS https://github.com/thewh1teagle/vibe/issues/269

Ubuntu

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 620 (KBL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
2024-09-09T10:58:08.692125Z ERROR whisper_rs::whisper_sys_tracing: whisper_model_load: ERROR not all tensors loaded from model file - expected 947, got 3
2024-09-09T10:58:08.711251Z ERROR whisper_rs::whisper_sys_tracing: whisper_init_with_params_no_state: failed to load model

Arch https://github.com/thewh1teagle/vibe/issues/267

Windows https://github.com/thewh1teagle/vibe/issues/266

https://github.com/thewh1teagle/vibe/issues/263

Windows

ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GT 730 buffer from size 0.00 MiB to 565.06 MiB
ggml_vulkan: Device memory allocation of size 592512000 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate NVIDIA GeForce GT 730 buffer of size 592512000
thewh1teagle commented 1 week ago

@ggerganov

Do you have any suggestions on how we can improve the stability of ggml and whisper.cpp to reduce crashes (aborts) and ensure they consistently return errors instead?

ggerganov commented 1 week ago

Hm, I haven't tested the Vulkan backend with whisper.cpp at all, so cannot recommend any way to improve the stability. But looking at the error - this seems like its trying to load an invalid mode, no?

The other error seems like the GPU device runs out of memory. I think your application can check if there is enough available memory before trying to load the Whisper model.

thewh1teagle commented 1 week ago

@ggerganov

There's a lot of different issues with vulkan. for instance new issue reported that vulkan failed because it doesn't support fp16 storage https://github.com/ggerganov/llama.cpp/issues/7620

How can we fallback to CPU in case it failed? Vulkan is really important on Windows, that's the only wide GPU optimization we have currently on Windows.


I consider using OpenVino instead on Windows, but last time I checked it requires special files to be installed / special model file so it won't work better than Vulkan in dekstop app.

thewh1teagle commented 2 days ago

@ggerganov

I've noticed that CoreML/Metal includes a fallback mechanism to CPU. Since Vulkan has compatibility issues on many modern PCs, it would be great if Vulkan could have a similar fallback.

Would you be able to outline the steps needed to implement a CPU fallback for Vulkan? I'm willing to work on it and collaborate with others to push this forward. Should I focus on this in the ggml repository or in whisper.cpp?

Thanks!

ggerganov commented 1 day ago

I think the fallback mechanism only applies to operators that are not yet implemented on the backend. Are there such operators in the Vulkan backend?

With the change that I just pushed, the memory usage should be reduced significantly. I will make a new whisper.cpp release in the following days, and after that, if the issues still persist, we can discuss how to improve the Vulkan state.

thewh1teagle commented 1 day ago

@ggerganov

Tiny model still fail to load on latest commit with vulkan. 1GB of gpu is available

C:\ReallyTempEmptyEveryDay\vibe.test>.\vibe.exe

C:\ReallyTempEmptyEveryDay\vibe.test>ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce GTX 1660 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 11.08 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 60.29 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_gallocr_needs_realloc: graph has different number of nodes
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 2.20 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.00 MiB
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
ggml_gallocr_reserve_n: reallocating NVIDIA GeForce GTX 1660 Ti buffer from size 0.00 MiB to 89.95 MiB
ggml_vulkan: Device memory allocation of size 94318336 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate NVIDIA GeForce GTX 1660 Ti buffer of size 94318336

I think the fallback mechanism only applies to operators that are not yet implemented on the backend. Are there such operators in the Vulkan backend?

Not that I'm aware of. I thought that it fallback completely to cpu. That should be useful

ggerganov commented 22 hours ago

@thewh1teagle Can you confirm that the memory allocation issue is now fixed with the latest commit on master?