Closed yurivict closed 3 weeks ago
Vulkan attempts to pin the allocated CPU memory (basically allocate it in a way so that it can be transferred into VRAM without needing a memcpy to a staging buffer), this may fail if the driver doesn't provide enough host memory.
You can check how much host memory your driver provides with vulkaninfo
, look at the memory heaps.
When there is not enough memory available the allocation fails, prints this error message and falls back to a regular CPU allocation. It makes a small difference for performance, but you can safely ignore the error.
I had a similar issue too, but it was with a model (llama 3.1 to be exactly) that could accept large context and I assume the big context size was also the default value.
All I had to do is to use --ctx_size=16384
to fix the issue (the model had a way bigger default context size and yes, this context size is bigger than what is needed in the models above).
Hopefully you can find other tips on how to reduce memory consumption.
I trialed building llama.cpp today and ended up with the same issue yuri mentioned. (On windows, 64GB ram and 16GB vram gpu)
After trying to track down the issue myself, it seems related to vulkan (sometimes?) having an allocation limit of 4GB (per allocation) as it shows on my machine when I use vulkaninfo.
It is only when using a combined model+context that (seemingly) comes in under 4GB that I do not receive said error.
I tried models of size 2.5GB, 5.5GB, 12GB, 17GB.
Only the 2.5GB model plus a small 1024 context length resulted in no memory related errors.
edit: now I am less sure. I can finagle the context up as high as 20000 without it giving the memory error, so long as I specify the number of layers to be offloaded
These will work
llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 --gpu-layers 33
llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 20000 --gpu-layers 33
llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 1024
llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 3072
Meanwhile this won't
llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080
llama-server.exe -m .\models\llama-2-7b.Q2_K.gguf --port 8080 -c 4096
Edit 2: Seems like explicitly telling it to offload X layers, even when that is all of the layers, and the context length is specified, so long as that total comes to less than VRAM, it works. I assume it's making one allocation per layer when told to do so explicitly, but am not sure how to check.
For instance with a 17.5GB (43 offload-able layers) model on a 16GB card
works:
llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 30 -c 4096
llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 38 -c 512
fails:
llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 43 -c 512
llama-server.exe -m .\models\codellama-34b.Q4_0.gguf --port 8080 --gpu-layers 38 -c 1024
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
llama-cpp prints this error when larger models are imported:
The complete log is:
Name and Version
FreeBSD 14.1
What operating system are you seeing the problem on?
No response
Relevant log output
No response