Closed DarkReaperBoy closed 1 year ago
i know it might get ignored as issue, but wanted to feature request since i hate the balancing
If you have problems with particular value of ngl, try smaller values.
Unless the model easily fits in VRAM, finding optimal value for ngl is a non-trivial task, and is usually done by trial and error. A formula that works well in most cases is probably not known, and I don't expect it to be found any time soon.
One strategy that may work is to fill only 1/2 or 2/3 of VRAM with model parameters, leaving the rest for intermediate data and overhead.
I personally use CLBlast back-end, because its VRAM requirements are more predictable than those of hipBLAS (another back-end that works with my GPU).
i see, the sweet spot with rtx 2060 mobile (6 gigs of vram) and a 13gb model is 18 layers out of 40 layers, still though, after a long conversation (at 4k token) it gets cuda error and i have to reduce to 16 layer for it to even generate, and i find it frustrating since i have to load the model all over again since i offload the model into ram, but since you said "A formula that works well in most cases is probably not known, and I don't expect it to be found any time soon." i think i'll close the issue, thanks for your time 🙏
I noticed that too with ROCm/hipBLAS back-end, which shares code with CUDA/cuBLAS back-end. It leaks or wastes VRAM. That's why I went back to OpenCL/CLBlast back-end for the time being.
I wrote a script to give me a table of the max model size my system can load for each model I have.
` $ ./ltest.sh (standard_in) 1: syntax error 7 models found! Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q4_K_M.gguf Layers: 99 50 25 13 7 10 9 8 MAX=7 Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q5_K_M.gguf Layers: 99 50 25 13 7 4 6 7 MAX=6 Testing model: /media/sdb3/Models/multimodal/ggml-model-q5_k.gguf Layers: 99 50 25 13 19 16 15 16 MAX=15 Testing model: /media/sdb3/Models/nous-capybara-7b-v1.9.Q5_K_M.gguf Layers: 99 50 25 13 19 16 15 16 MAX=15 Testing model: /media/sdb3/Models/openchat_3.5.Q4_0.gguf Layers: 99 50 25 13 19 22 21 20 MAX=19
` You could use this as a base for your own version of the script.
#!/bin/bash
myprompt="What has more legs? A dog or a chicken?"
#Get a list of models and stuff them into an array
declare -a models
#rawmodels=($(find -L /media/sdb3/Models /media/sd/Projects/Neural/LLM/ModelsBigGGUF /dir/Neural/LLM/ModelsBigGGUF2 -type f -size +1000M -name \*.gguf))
rawmodels=($(find -L /media/sdb3/Models -type f -size +1000M -name \*.gguf))
for file in "${rawmodels[@]}"
do
unsoptions+=("$file")
done
models=($(printf "%s\n" "${unsoptions[@]}" |sort ))
nummodels=${#models[@]}
if (( $(echo "$nummodels = 0" | bc -l) )); then
echo "No Models Found!"
exit
fi
i=0
for model in "${models[@]}"; do
echo "${model}"
i=$((i+1))
done
echo "${nummodels} models found!"
max_layers=()
for model in "${models[@]}"; do
echo "Testing model: $model"
echo -n "Layers: "
# Initialize variables
layers=99; maxlayer=0; oldlayer=-1; leastbad=99
count=0; testcount=10
success=false
while ! "$success"; do
echo -n "${layers} "
myout=$(/pr/Neural/LLM/llama.cpp/build/bin/main -m ${model} -t 8 --temp 0.45 --mirostat 2 --mirostat-ent 6 --mirostat-lr 0.2 -n 2 -c 2048 -b 512 --repeat-last-n 1600 --repeat-penalty 1.2 --log-disable -ngl ${layers} -p "${myprompt}" 2>&1 > /dev/null)
exit_code=$?
if [ $exit_code -eq 0 ]; then
# If successful, note maximum successful layer size
if [ $layers -gt $maxlayer ]; then
maxlayer=$layers
fi
# If successful, try to increase the number of layers
halfdiff=$(((leastbad - layers + 1) / 2))
layers=$((layers + halfdiff))
if (( $(echo "$layers > 99" | bc -l) )); then
success=true
fi
else
# If failed see if this is the smallest failed layer size
if [ $layers -lt $leastbad ]; then
leastbad=$layers
fi
# If failed, halve the number of layers between current layer and largest known working layer (initially 0)
halfdiff=$(((layers - maxlayer + 1) / 2))
layers=$((maxlayer + halfdiff))
fi
# Stop if we're getting identical layers
if [ $oldlayer -eq $layers ]; then
success=true
if [ $exit_code -eq 0 ]; then
maxlayer=0
fi
fi
# Stop after a certain number of tests
if [ $count -gt $testcount ]; then
success=true
fi
count=$((count+1))
oldlayer=$layers
done
max_layers+=("$maxlayer")
echo " MAX=${maxlayer}"
done
# Print results
echo ""
i=0
for model in "${models[@]}"; do
echo "${model} ${max_layers[$i]}"
i=$((i+1))
done
It's not error free and it will require changes to your circumstances.
I wrote a script to give me a table of the max model size my system can load for each model I have.
` $ ./ltest.sh 7 models found! Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q4_K_M.gguf Layers: 99 50 25 13 7 10 9 8 MAX=7 Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q5_K_M.gguf Layers: 99 50 25 13 7 4 6 7 MAX=6 ...
Any reason why a python equivalent of the above shell script
I'm interested in using a specific model (the 13b q4_k_m llama2 chat) with GPU. This is not a complete solution, just a record of some experiments I did. I don't know llama.cpp and C++ very well. The max memory requirement for the model is taken from https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF and used to calculate the memory per layer. There are probably better ways to get it with the llama.cpp classes instead but I'm not interested in looking into that right now. Running with llama-cli displays some info including that it has 41 layers so that's roughly 250MB per layer after the max ram required (10370MB) is divided by 41. After some tests I felt that this could be lower than the actual value so I increased it to 270 to be on the safe side. I'm playing with the Vulkan backend modifying the main.cpp/llama-cli example, using commit 7ea8d80d on a Windows 10 system with 8 cores and a dedicated graphics card with 8GB VRAM.
First we include the ggml-vulkan header in main.cpp because it has the method ggml_backend_vk_get_device_memory that we need to get the total amount of VRAM:
#include "ggml-vulkan.h"
Then in the main function start we can optionally set some parameters, use_mmap set to false seems to help display the actual RAM usage in task manager and LLAMA_SPLIT_MODE_NONE limits usage to a single gpu if there are multiple, I think. My system has a single GPU and it's outside the scope of this experiment to try guess the behavior on configurations I don't own:
params.use_mmap = false;
params.split_mode = LLAMA_SPLIT_MODE_NONE;
Now let's call ggml_backend_vk_get_device_memory. We place the calls before llama_init_from_gpt_params is called.
ggml_backend_vk_init(params.main_gpu);
size_t gpu_total_mem;
size_t gpu_free_mem;
ggml_backend_vk_get_device_memory(params.main_gpu, &gpu_free_mem, &gpu_total_mem);
It also returns a free value but they're both the same at the time of writing so now we guess the number of layers to offload to the GPU:
gpu_total_mem = gpu_total_mem / (1024 * 1024) - 1200;
if(gpu_total_mem > 2700) params.n_gpu_layers = gpu_total_mem / 270;
First gpu_total_mem is converted to MB and reduced by 1.2GB because there may be other running processes that require VRAM. Then if the GPU has at least 4GB VRAM we divide the amount of memory we plan to use by the memory per layer value to determine how many we can offload. If not it gets more tricky both in terms of calculating the number of layers and determining if it would improve performance. I think a more precise solution may be needed in that case and maybe also try CUDA instead since on my system I'm not sure I see any benefit using Vulcan for that specific model with 4GB VRAM except to reduce RAM usage by offloading layers to VRAM.
Bonus round for anybody who might be interested in getting the type of the GPU. We could insert a method to return an int with the GPU type that can have values listed in the VkPhysicalDeviceType enum, online reference at https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPhysicalDeviceType.html
Add this to ggml-vulkan.h:
GGML_API GGML_CALL int ggml_backend_vk_get_device_type(int device);
And then ggml-vulkan.cpp:
GGML_CALL int ggml_backend_vk_get_device_type(int device) {
GGML_ASSERT(device < (int)vk_instance.device_indices.size());
vk::PhysicalDevice vkdev = vk_instance.instance.enumeratePhysicalDevices()[vk_instance.device_indices[device]];
vk::PhysicalDeviceProperties2 new_props;
vkdev.getProperties2(&new_props);
return (int)new_props.properties.deviceType;
}
Then we could call it in main.cpp like:
int deviceType = ggml_backend_vk_get_device_type(params.main_gpu);
This could then be used to create logic that limits the implementation only to configurations we're able to test and leave the rest to llama.cpp defaults.
@svetlyo81 Does your method use additional data like from the page that you linked?
@shibe2 Yes, please find the data below in case the link can't be opened.
The physical device types which may be returned in VkPhysicalDeviceProperties::deviceType are:
typedef enum VkPhysicalDeviceType {
VK_PHYSICAL_DEVICE_TYPE_OTHER = 0,
VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU = 1,
VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU = 2,
VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU = 3,
VK_PHYSICAL_DEVICE_TYPE_CPU = 4,
} VkPhysicalDeviceType;
VK_PHYSICAL_DEVICE_TYPE_OTHER - the device does not match any other available types.
VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU - the device is typically one embedded in or tightly coupled with the host.
VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU - the device is typically a separate processor connected to the host via an interlink.
VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU - the device is typically a virtual node in a virtualization environment.
VK_PHYSICAL_DEVICE_TYPE_CPU - the device is typically running on the same processors as the host.
The physical device type is advertised for informational purposes only, and does not directly affect the operation of the system. However, the device type may correlate with other advertised properties or capabilities of the system, such as how many memory heaps there are.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do.adding
--auto-devices
causes to auto detect n-gpu-layers and balance it, or at least adding--gpu
do itCurrent Behavior
Please provide a detailed written description of what
llama.cpp
did, instead.it instead just doesn't work with it by default, and you have to specify
--n-gpu-layers x
and after a while of chatting, it gives cuda error with sillytavern + webui, or add too much layer and it becomes resource hungry and freeze the system 🤷Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu