Closed Firstbober closed 1 year ago
I suspect you're hitting some internal memory buffer limit in ggml.c
or maybe CLBlast. Can you watch the memory utilisation on your GPU when it is running? For CUDA I would use nvidia-smi
.
Given that you only have 4GB of VRAM, are you setting n_gpu_layers
? If so, try reducing to a smaller number and seeing if that changes when the problem occurs, perhaps? For reference Vicuna 13B w/40 CuBLAS layers on my NVidia GPU uses 11GB of VRAM.
If you have a single stand-alone python script that generates the error, I can try and reproduce with my NVidia GPU. If I can't repro it points to CLBlast as part of the issue.
Finally, stupid question, but did you use the exact same params and prompt length with ./main
?
I suspect you're hitting some internal memory buffer limit in
ggml.c
or maybe CLBlast. Can you watch the memory utilisation on your GPU when it is running? For CUDA I would usenvidia-smi
.
Using radeontop
I registered nothing out of ordinary. Through the entire run time of the script, the VRAM utilization stayed at the comfortable range of ~1830M.
Given that you only have 4GB of VRAM, are you setting
n_gpu_layers
? If so, try reducing to a smaller number and seeing if that changes when the problem occurs, perhaps? For reference Vicuna 13B w/40 CuBLAS layers on my NVidia GPU uses 11GB of VRAM.
I specified 32 n_gpu_layers
in my ./main
and in my python script I just use the defaults. AFAIK the 7B models has 31 layers, which easily fit into my VRAM, as while chatting for a while using ./main
example I sit at around 2100M with more than 500 tokens generated already.
If you have a single stand-alone python script that generates the error, I can try and reproduce with my NVidia GPU. If I can't repro it points to CLBlast as part of the issue.
https://gist.github.com/Firstbober/a08de9cf01ea90b6be8389be9a249857 I changed the prompt a few times, and in some cases the error doesn't appear. Maybe there is something in it that makes the library uncomfortable? The prompt I attached in the script is the one that seg faults, the DAN one from the llama.cpp repo seems to be working fine.
Finally, stupid question, but did you use the exact same params and prompt length with
./main
?
Yes
I modified your script to take the model from sys.argv[1]
and I also notice it isn't offloading any layers to the GPU or loading the GPU.
I suspect the CLBast support is very new and somewhat unstable, given that most devs (including @ggerganov) are using NVidia GPUs, sorry.
Well, I compiled libllama.so without the support for CLBlast and segmentation fault still persists.
I pinned the problem down to the n_past
argument in the llama_eval
, so the next few hours will be figuring out how to stop llama from repeating itself after reaching max context… Yay. I saw something about context switching in the ./main
example, so probably it will be useful.
Just spent a few hours debugging on a related issue.
The n_parts
argument got removed from a recent version of llama.cpp, so if you are compiling from source on a newer commit, you will hit this issue.
This llama.cpp commit removes the n_parts
parameter: https://github.com/ggerganov/llama.cpp/commit/dc271c52ed65e7c8dfcbaaf84dabb1f788e4f3d0
So this code in llama-cpp-python is now invalid when paired with llama.cpp mainline: https://github.com/abetlen/llama-cpp-python/blob/01a010be521c076f851789ad56bec82284fdf96e/llama_cpp/llama_cpp.py#L116
Deleting this line fixes the issue.
For me, it manifested as GGML_ASSERT: ggml.c:5702: ggml_is_contiguous(a)
but I think it could manifest in a lot of ways since it is basically just memory corruption due to the literal byte cast interpretation of this params struct.
As a side note, I don't think token generation is actually accelerated in CLBlast yet? Its behind a PR in the llama.cpp repo, and my observation is that the GPU has no load no matter what I set n_gpu_layers to... but maybe something was wrong with my quick CLBlast test.
Point being that maybe this has nothing to do with the GPU.
https://github.com/ggerganov/llama.cpp/pull/1459 adds fairly good OpenCL support, was merged 5 days ago. Also in the readme it now says "The CLBlast build supports --gpu-layers|-ngl like the CUDA version does."
I've tested the win clblast builds, and they works pretty well on my 3080, with 250ms per token with some offloading, and 450ms without offloading. With that said, I can't get it to work with llama-cpp-python. It seems to ignore gpu layers with clblast.
I confirmed that the latest llama-cpp-python
should have picked up the CLBlast support:
/vendor/llama.cpp$ git log | head -3
commit 66874d4fbcc7866377246efbcee938e8cc9c7d76
Author: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Date: Thu May 25 20:18:01 2023 -0600
Closing. Please update to the latest llama-cpp-python
which should include better CLBlast support.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Continue the generation and gracefully exit.
Current Behavior
Segmentation fault while generating tokens. It usually happens after generating ~121 tokens (I did 4 different prompts which crashed at token 122, 121, 118 and 124), and it doesn't seem to happen in the llama.cpp
./main
example.Environment and Context
I am utilizing context size of 512, prediction 256 and batch 1024. The rest of the settings are default. I am also utilizing CLBlast which on llama.cpp gives me 2.5x boost in performence. I am also using libllama.so built from the latest
llama.cpp
source, so I can debug it with gdb.Linux bober-desktop 6.3.1-x64v1-xanmod1-2 #1 SMP PREEMPT_DYNAMIC Sun, 07 May 2023 10:32:57 +0000 x86_64 GNU/Linux
Failure Information (for bugs)
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
llamaChat.load_context
with some lengthy prompt (mine has 1300 characters)llamaChat.generate
try to generate something, I used this piece of code:print(tokens)