Closed daniandtheweb closed 1 month ago
Testing with 16k lenght on the current build q5_k_m breaks and outputs gibberish. 32k seems to work fine and produces good results and 8k works well too. Since it's related with the context increase it may be related to this issue and it maybe could be fixed by denying the offload using a lowvram like option. (apparently this gibberish issue at 16k isn't present on main llama.cpp) EDIT: this seems to be an issue on upstream llama.cpp
Right now the vulkan backend is quite fast and it almost reaches ROCm speeds. Would it be possible to add lowvram as an option for vulkan in order to manually lower the vram usage for higher context lenghts?