LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.36k stars 312 forks source link

Lowvram for vulkan #830

Closed daniandtheweb closed 1 month ago

daniandtheweb commented 2 months ago

Right now the vulkan backend is quite fast and it almost reaches ROCm speeds. Would it be possible to add lowvram as an option for vulkan in order to manually lower the vram usage for higher context lenghts?

daniandtheweb commented 2 months ago

Testing with 16k lenght on the current build q5_k_m breaks and outputs gibberish. 32k seems to work fine and produces good results and 8k works well too. Since it's related with the context increase it may be related to this issue and it maybe could be fixed by denying the offload using a lowvram like option. (apparently this gibberish issue at 16k isn't present on main llama.cpp) EDIT: this seems to be an issue on upstream llama.cpp