ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.85k stars 9.73k forks source link

[Request] adding auto mode to n-gpu-layers #3719

Closed DarkReaperBoy closed 1 year ago

DarkReaperBoy commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do.

adding --auto-devices causes to auto detect n-gpu-layers and balance it, or at least adding --gpu do it

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

it instead just doesn't work with it by default, and you have to specify --n-gpu-layers x and after a while of chatting, it gives cuda error with sillytavern + webui, or add too much layer and it becomes resource hungry and freeze the system 🤷

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu


  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
    CPU family:          6
    Model:               165
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            2
    CPU(s) scaling MHz:  83%
    CPU max MHz:         5000.0000
    CPU min MHz:         800.0000
    BogoMIPS:            5202.65
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re
                         p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
                          xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 s
                         mep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi pku ospke sgx_lc md_clear flus
                         h_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; Enhanced IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Not affected```

* Operating System, e.g. for Linux:

`$ uname -a`

Linux localhost.localdomain 6.5.6-1-default #1 SMP PREEMPT_DYNAMIC Fri Oct  6 11:20:48 UTC 2023 (c97c2df) x86_64 x86_64 x86_64 GNU/Linux (opensuse thumbleweed)
DarkReaperBoy commented 1 year ago

i know it might get ignored as issue, but wanted to feature request since i hate the balancing

shibe2 commented 1 year ago

If you have problems with particular value of ngl, try smaller values.

Unless the model easily fits in VRAM, finding optimal value for ngl is a non-trivial task, and is usually done by trial and error. A formula that works well in most cases is probably not known, and I don't expect it to be found any time soon.

One strategy that may work is to fill only 1/2 or 2/3 of VRAM with model parameters, leaving the rest for intermediate data and overhead.

I personally use CLBlast back-end, because its VRAM requirements are more predictable than those of hipBLAS (another back-end that works with my GPU).

DarkReaperBoy commented 1 year ago

i see, the sweet spot with rtx 2060 mobile (6 gigs of vram) and a 13gb model is 18 layers out of 40 layers, still though, after a long conversation (at 4k token) it gets cuda error and i have to reduce to 16 layer for it to even generate, and i find it frustrating since i have to load the model all over again since i offload the model into ram, but since you said "A formula that works well in most cases is probably not known, and I don't expect it to be found any time soon." i think i'll close the issue, thanks for your time 🙏

shibe2 commented 1 year ago

I noticed that too with ROCm/hipBLAS back-end, which shares code with CUDA/cuBLAS back-end. It leaks or wastes VRAM. That's why I went back to OpenCL/CLBlast back-end for the time being.

clort81 commented 1 year ago

I wrote a script to give me a table of the max model size my system can load for each model I have.

` $ ./ltest.sh (standard_in) 1: syntax error 7 models found! Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q4_K_M.gguf Layers: 99 50 25 13 7 10 9 8 MAX=7 Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q5_K_M.gguf Layers: 99 50 25 13 7 4 6 7 MAX=6 Testing model: /media/sdb3/Models/multimodal/ggml-model-q5_k.gguf Layers: 99 50 25 13 19 16 15 16 MAX=15 Testing model: /media/sdb3/Models/nous-capybara-7b-v1.9.Q5_K_M.gguf Layers: 99 50 25 13 19 16 15 16 MAX=15 Testing model: /media/sdb3/Models/openchat_3.5.Q4_0.gguf Layers: 99 50 25 13 19 22 21 20 MAX=19

` You could use this as a base for your own version of the script.

#!/bin/bash

myprompt="What has more legs? A dog or a chicken?"

#Get a list of models and stuff them into an array
declare -a models
#rawmodels=($(find -L /media/sdb3/Models /media/sd/Projects/Neural/LLM/ModelsBigGGUF /dir/Neural/LLM/ModelsBigGGUF2 -type f -size +1000M -name \*.gguf))
rawmodels=($(find -L /media/sdb3/Models -type f -size +1000M -name \*.gguf))
for file in "${rawmodels[@]}"
do
 unsoptions+=("$file")
done
models=($(printf "%s\n" "${unsoptions[@]}" |sort ))
nummodels=${#models[@]}
if (( $(echo "$nummodels = 0" | bc -l) )); then
        echo "No Models Found!"
        exit
fi
i=0
for model in "${models[@]}"; do
    echo "${model}"
    i=$((i+1))
done
echo "${nummodels} models found!"

max_layers=()

for model in "${models[@]}"; do
    echo "Testing model: $model"
    echo -n "Layers: "

    # Initialize variables
    layers=99; maxlayer=0; oldlayer=-1; leastbad=99
    count=0; testcount=10
    success=false

    while ! "$success"; do

            echo -n "${layers} "
            myout=$(/pr/Neural/LLM/llama.cpp/build/bin/main  -m ${model} -t 8 --temp 0.45 --mirostat 2 --mirostat-ent 6 --mirostat-lr 0.2 -n 2 -c 2048 -b 512 --repeat-last-n 1600 --repeat-penalty 1.2 --log-disable -ngl ${layers} -p "${myprompt}" 2>&1 > /dev/null)

        exit_code=$?

        if [ $exit_code -eq 0 ]; then

            # If successful, note maximum successful layer size 
            if [ $layers -gt $maxlayer ]; then
                    maxlayer=$layers
            fi

            # If successful, try to increase the number of layers
            halfdiff=$(((leastbad - layers + 1) / 2))
            layers=$((layers + halfdiff))

            if (( $(echo "$layers > 99" | bc -l) )); then
                success=true
            fi
        else
            # If failed see if this is the smallest failed layer size
            if [ $layers -lt $leastbad ]; then
                    leastbad=$layers
            fi

            # If failed, halve the number of layers between current layer and largest known working layer (initially 0)
            halfdiff=$(((layers - maxlayer + 1) / 2))
            layers=$((maxlayer + halfdiff))
        fi
        # Stop if we're getting identical layers 
        if [ $oldlayer -eq $layers ]; then
            success=true
            if [ $exit_code -eq 0 ]; then
                maxlayer=0
            fi
        fi

        # Stop after a certain number of tests
        if [ $count -gt $testcount ]; then
             success=true
        fi
        count=$((count+1))
        oldlayer=$layers
    done
    max_layers+=("$maxlayer")
    echo " MAX=${maxlayer}"
done

# Print results
echo ""
i=0
for model in "${models[@]}"; do
    echo "${model} ${max_layers[$i]}"
    i=$((i+1))
done

It's not error free and it will require changes to your circumstances.

pcompieta commented 11 months ago

I wrote a script to give me a table of the max model size my system can load for each model I have.

` $ ./ltest.sh 7 models found! Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q4_K_M.gguf Layers: 99 50 25 13 7 10 9 8 MAX=7 Testing model: /media/sdb3/Models/33-Zephyrus_Prude/Zephyrus-L1-33B.q5_K_M.gguf Layers: 99 50 25 13 7 4 6 7 MAX=6 ...

Any reason why a python equivalent of the above shell script

svetlyo81 commented 2 months ago

I'm interested in using a specific model (the 13b q4_k_m llama2 chat) with GPU. This is not a complete solution, just a record of some experiments I did. I don't know llama.cpp and C++ very well. The max memory requirement for the model is taken from https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF and used to calculate the memory per layer. There are probably better ways to get it with the llama.cpp classes instead but I'm not interested in looking into that right now. Running with llama-cli displays some info including that it has 41 layers so that's roughly 250MB per layer after the max ram required (10370MB) is divided by 41. After some tests I felt that this could be lower than the actual value so I increased it to 270 to be on the safe side. I'm playing with the Vulkan backend modifying the main.cpp/llama-cli example, using commit 7ea8d80d on a Windows 10 system with 8 cores and a dedicated graphics card with 8GB VRAM.

First we include the ggml-vulkan header in main.cpp because it has the method ggml_backend_vk_get_device_memory that we need to get the total amount of VRAM:

#include "ggml-vulkan.h"

Then in the main function start we can optionally set some parameters, use_mmap set to false seems to help display the actual RAM usage in task manager and LLAMA_SPLIT_MODE_NONE limits usage to a single gpu if there are multiple, I think. My system has a single GPU and it's outside the scope of this experiment to try guess the behavior on configurations I don't own:

params.use_mmap = false;
params.split_mode = LLAMA_SPLIT_MODE_NONE;

Now let's call ggml_backend_vk_get_device_memory. We place the calls before llama_init_from_gpt_params is called.

ggml_backend_vk_init(params.main_gpu);

size_t gpu_total_mem;
size_t gpu_free_mem;
ggml_backend_vk_get_device_memory(params.main_gpu, &gpu_free_mem, &gpu_total_mem);

It also returns a free value but they're both the same at the time of writing so now we guess the number of layers to offload to the GPU:

gpu_total_mem = gpu_total_mem / (1024 * 1024) - 1200;
if(gpu_total_mem > 2700) params.n_gpu_layers = gpu_total_mem / 270;

First gpu_total_mem is converted to MB and reduced by 1.2GB because there may be other running processes that require VRAM. Then if the GPU has at least 4GB VRAM we divide the amount of memory we plan to use by the memory per layer value to determine how many we can offload. If not it gets more tricky both in terms of calculating the number of layers and determining if it would improve performance. I think a more precise solution may be needed in that case and maybe also try CUDA instead since on my system I'm not sure I see any benefit using Vulcan for that specific model with 4GB VRAM except to reduce RAM usage by offloading layers to VRAM.

Bonus round for anybody who might be interested in getting the type of the GPU. We could insert a method to return an int with the GPU type that can have values listed in the VkPhysicalDeviceType enum, online reference at https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPhysicalDeviceType.html

Add this to ggml-vulkan.h:

GGML_API GGML_CALL int ggml_backend_vk_get_device_type(int device);

And then ggml-vulkan.cpp:

GGML_CALL int ggml_backend_vk_get_device_type(int device) {
    GGML_ASSERT(device < (int)vk_instance.device_indices.size());

    vk::PhysicalDevice vkdev = vk_instance.instance.enumeratePhysicalDevices()[vk_instance.device_indices[device]];

    vk::PhysicalDeviceProperties2 new_props;
    vkdev.getProperties2(&new_props);
    return (int)new_props.properties.deviceType;
}

Then we could call it in main.cpp like:

int deviceType = ggml_backend_vk_get_device_type(params.main_gpu);

This could then be used to create logic that limits the implementation only to configurations we're able to test and leave the rest to llama.cpp defaults.

shibe2 commented 2 months ago

@svetlyo81 Does your method use additional data like from the page that you linked?

svetlyo81 commented 2 months ago

@shibe2 Yes, please find the data below in case the link can't be opened.

The physical device types which may be returned in VkPhysicalDeviceProperties::deviceType are:

typedef enum VkPhysicalDeviceType {
    VK_PHYSICAL_DEVICE_TYPE_OTHER = 0,
    VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU = 1,
    VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU = 2,
    VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU = 3,
    VK_PHYSICAL_DEVICE_TYPE_CPU = 4,
} VkPhysicalDeviceType;

VK_PHYSICAL_DEVICE_TYPE_OTHER - the device does not match any other available types.

VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU - the device is typically one embedded in or tightly coupled with the host.

VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU - the device is typically a separate processor connected to the host via an interlink.

VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU - the device is typically a virtual node in a virtualization environment.

VK_PHYSICAL_DEVICE_TYPE_CPU - the device is typically running on the same processors as the host.

The physical device type is advertised for informational purposes only, and does not directly affect the operation of the system. However, the device type may correlate with other advertised properties or capabilities of the system, such as how many memory heaps there are.