LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 312 forks source link

vulkan: garbage output followed by GPU crash #897

Open llfw opened 4 weeks ago

llfw commented 4 weeks ago

hello,

i'm using:

built with:

LLAMA_OPENBLAS = 1
LLAMA_CLBLAST  = 1
LLAMA_VULKAN   = 1

LDFLAGS = -L/usr/local/lib

the web interface starts and runs fine, but the model immediately produces garbage output (random binary strings) and after a couple of iterations, will eventually crash the GPU. i'm assuming the GPU crash is only a symptom of another problem.

this looks a bit like https://github.com/ggerganov/llama.cpp/issues/5179, but from what i can see the fix for that is already in koboldcpp.

using a CPU backend (e.g., OpenBLAS) works fine, aside from being very slow.

LostRuins commented 3 weeks ago

What about the vulkan backend with 0 layers offloaded?

llfw commented 3 weeks ago

so i did a bit more testing: 0 layers works fine, and a small number (around 5-10) also seems to work. increasing it much past 10 eventually triggers the problem. is this perhaps to do with running out of VRAM? the memory use figures that koboldcpp reported didn't seem very high (at least for a 16GB card) but i'm not sure how to find out how much VRAM is actually in use.

i also tested with the same hardware on Linux (Ubuntu 24.04 using the pre-compiled koboldcpp-nocuda) and i couldn't seem to trigger the problem there, even with 40 layers offloaded - but, interestingly, i could trigger the problem with lambda.cpp, even on Linux, when i compiled it myself. i wonder if this is something to do with the compiler optimisations in use? the CPU is a Ryzen 5800X3D (Zen 3 core).

as it's working on FreeBSD with fewer layers i'm happy with that, but if it is a VRAM issue, perhaps there's a way to fail gracefully rather than crashing.

LostRuins commented 3 weeks ago

Perhaps @0cc4m can take a look, especially since you mention it happens upstream too.

It working with a few layers but failing with more layers sounds odd but I doubt it's a compiler issue. Are you running oom or near oom?