Closed akumaburn closed 1 year ago
Larger Llama 2 models use GQA, which uses an operation that is currently not supported by CLBlast back-end. I'm working on implementing it. #3002
I've copied the changes to my project, and it seems to fix the issue. Tested on phind-codellama-34b-v1.Q4_K_S.gguf, both with latest main commits and with custom attention mask.
Great news!
My current work is here. I recommend it over my previous attempts, if you want to use it. Although the results should be equivalent, I've done much more testing on the latest code. Detailed discussion in #3002.
Seems like the new version works too. No problems with llama models, although I've only tested smaller ones (7b and 13b). Previously tested phind-codellama-34b-v1.Q4_K_S.gguf still works.
On a sidenote, it also works with stable-diffusion.cpp as well (although it doesn't have layers offloading at the moment, there is a small speedup in generation with cblbast). So I assume it doesn't break anything in general?
@shibe2 Thanks! Tested your branch: https://github.com/shibe2/llama.cpp/commit/f5ed18bfa71878fffa4733d75095952e407a62f7 with a AMD GPU and LLAMA_CLBLAST with mistral-7b-v0.1.Q4_0.gguf
- works great!
Yep, just encountered this exact problem of ne02 == ne12 on mistral-7b-instruct-v0.1.Q4_K_S.gguf while testing latest commit server. Seems like this problem is not exclusive to 34b models now. The fix still works. @shibe2 Thank you for this patch!
Same here. Thanks @shibe2 !
Yeah, Mistral 7B uses GQA too.
Thanks @shibe2 !
hey, I'm having the same error when using the llama-cpp-python library (within langchain). I saw that there is a solution for the llama-cpp package, but I don't understand how to use this when installing the llama-cpp-python package or if it is even possible. I'm not an expert with setting up an working environment, so any help would be very appreciated from my side.
@mzeq1717 The Python package must be up to date with the latest changes in llama.cpp. If you build it from the main branch now, it should include support for GQA and other fixes in OpenCL back-end.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
llama.cpp compiled with:
make LLAMA_CLBLAST=ON
Should support prompt lengths larger than 116 (114 not counting the quotes) characters.
Works with the following:
/home/user/Desktop/Projects/llama.cpp/main --interactive --mlock --ctx_size 4096 --temp 0.239 --top_k 200 --top_p 0.945 --repeat_last_n 512 --batch_size 4096 --repeat_penalty 1.0 --keep -1 --model /home/user/Desktop/Projects/LLaMA/wizardlm-1.0-uncensored-codellama-34b.Q5_K_M.gguf --threads 16 --n_predict 4096 --reverse-prompt User: --n-gpu-layers 16 --prompt "A transcript of a dialog, where the User interacts with his servant named Mia. Mia is an expert in all subjects.12"
Does not work with the following:
/home/user/Desktop/Projects/llama.cpp/main --interactive --mlock --ctx_size 4096 --temp 0.239 --top_k 200 --top_p 0.945 --repeat_last_n 512 --batch_size 4096 --repeat_penalty 1.0 --keep -1 --model /home/user/Desktop/Projects/LLaMA/wizardlm-1.0-uncensored-codellama-34b.Q5_K_M.gguf --threads 16 --n_predict 4096 --reverse-prompt User: --n-gpu-layers 16 --prompt "A transcript of a dialog, where the User interacts with his servant named Mia. Mia is an expert in all subjects.123"
To be clear this prompt may work when
llama.cpp
is compiled withoutLLAMA_CLBLAST=ON
Current Behavior
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Linux phoenix-pc 6.4.12-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 24 Aug 2023 00:37:46 +0000 x86_64 GNU/Linux
(Running Arch Linux with Zen Kernel)
Failure Information (for bugs)
When llama.cpp is compiled using
LLAMA_CLBLAST=ON
option, it doesn't handle long prompts (longer than 114-116 characters).Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
make LLAMA_CLBLAST=ON
Failure Logs
Environment info: