LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.85k stars 342 forks source link

[Enhancement] Auto GPU layers option #390

Closed kalomaze closed 1 month ago

kalomaze commented 1 year ago

A setting that automatically gauges available VRAM and compares it with the size of the model being loaded into memory and selects the 'safe max' would be a nice QoL feature for first time users. Currently available VRAM would be a good way to gauge it

SabinStargem commented 1 year ago

I definitely would like to reserve some VRAM for non-AI tasks. 4gb out of my 3060's 12gb would be enough to feed video and most of the games that I play.

LostRuins commented 1 year ago

It's hard to estimate how much VRAM exactly will be used since its also affected by current context - the best way is really to trial and error (you can see vram usage from task manager) and tweak the layers offloaded. And then you can save that as known good settings

kalomaze commented 1 year ago

Is there no way to gauge how much the context size has an impact and consider that when deciding the number? Is it non-linear or unpredictable in any fashion? I'm aware that it's best to find out what specifically you can do because hardware will vary too much for it to always be 'perfect' prediction wise -- but as a default, it would be nice if it didn't require you to just go and retry a handful of times and gave you a number that's close enough to 'optimal'.

VL4DST3R commented 1 year ago

At the very least an automated script that tests this and reports back to the user the first valid amount it finds would be nice.

Something along the lines of loading the desired model and testing a prompt (to take into consideration the extra usage when using cublas and whatnot) and see if it generates properly, if not then reduce assigned layers by one and retry until it works.

This may actually be doable externally via a python script, but at this point it kinda defeats the purpose of having it part of the package.

marco-trovato commented 9 months ago

I confirm ollama does that automatically. I think a simple plaintext or .ini file with the MB/layer for each model would be an acceptable solution, comparing that to the GPU available memory would make it possible to somehow automate the correct amount.

LostRuins commented 9 months ago

There's a rudimentary layer estimator in recent versions. It's not very accurate though.

marco-trovato commented 9 months ago

There's a rudimentary layer estimator in recent versions

How can I activate it?

LostRuins commented 9 months ago

Open the model in the GUI, and it will show it when you select the file.

VL4DST3R commented 9 months ago

I think a simple plaintext or .ini file with the MB/layer for each model would be an acceptable solution, comparing that to the GPU available memory would make it possible to somehow automate the correct amount.

I believe that's a bit inconsistent as not all model layers are equal in size, not to mention blas and context size also affect the total amount.

If you are like me and rotate around a number of models with different sizes, you can just make a bat file with a menu selection and pre-populate it with all the desired settings for each model. It takes a bit of setting up but once you have it you don't need to think about it again.

marco-trovato commented 9 months ago

If you are like me and rotate around a number of models with different sizes, you can just make a bat file with a menu selection and pre-populate it with all the desired settings for each model.

That sounds like an awesome solution! Would you be so kind to share the script here? I will try to adapt it to my situation, THANK YOU!!!

VL4DST3R commented 9 months ago

my current script is a lot more convoluted with other QOL bells and whistles but the gist of it is:

echo off

echo  1.   model_1_user_friendly_name
echo  2.   model_2_user_friendly_name
echo  3.   model_3_user_friendly_name

choice /c 123 /n /m "Press the number corresponding to the preset to load:"

cls
echo Starting with preset [%errorlevel%]
goto Option%errorlevel%

:Option1
koboldcpp --model model_name_1 --usecublas normal --blasbatchsize 1024 --gpulayers 52 --contextsize 4096 

:Option2
koboldcpp --model model_name_2 --usecublas normal --gpulayers 35 --threads 12 --smartcontext --contextsize 16384

:Option3
koboldcpp --model model_name_3 --usecublas normal --threads 8 --contextsize 8192 --multiuser

You can go bananas with reusing chunks of options for example if you want to use the same context size and cublas for all prompts by storing them in a variable and using it throughout the individual "Options".

Do note that I prefer using the choice function so I don't have to press any other key to pick an entry, but this limits you to 10 number entries max (1..9 and then 0 oddly enough, but 0 returns 10 as ErrorLevel). You can opt to use letters instead (read more here).

Alternatively you can just use a set /p choice=Type the entry to select then press enter. so you can input any number of digits and do the option switching on the choice value instead of ErrorLevel

marco-trovato commented 9 months ago

Anyone have any idea how to select auto layers by command line? (because Linux version has no GUI)

LostRuins commented 9 months ago

You can't set the layers automatically from the command line, nor are you advised to. The GUI one is only a guideline intended for noobs, you still have to trial and error to find the real value.

LostRuins commented 1 month ago

Update on this: The newest version 1.73 (at this time) now has reasonably good auto GPU layers estimation. Give it a try!