Closed kalomaze closed 1 month ago
I definitely would like to reserve some VRAM for non-AI tasks. 4gb out of my 3060's 12gb would be enough to feed video and most of the games that I play.
It's hard to estimate how much VRAM exactly will be used since its also affected by current context - the best way is really to trial and error (you can see vram usage from task manager) and tweak the layers offloaded. And then you can save that as known good settings
Is there no way to gauge how much the context size has an impact and consider that when deciding the number? Is it non-linear or unpredictable in any fashion? I'm aware that it's best to find out what specifically you can do because hardware will vary too much for it to always be 'perfect' prediction wise -- but as a default, it would be nice if it didn't require you to just go and retry a handful of times and gave you a number that's close enough to 'optimal'.
At the very least an automated script that tests this and reports back to the user the first valid amount it finds would be nice.
Something along the lines of loading the desired model and testing a prompt (to take into consideration the extra usage when using cublas and whatnot) and see if it generates properly, if not then reduce assigned layers by one and retry until it works.
This may actually be doable externally via a python script, but at this point it kinda defeats the purpose of having it part of the package.
I confirm ollama does that automatically. I think a simple plaintext or .ini file with the MB/layer for each model would be an acceptable solution, comparing that to the GPU available memory would make it possible to somehow automate the correct amount.
There's a rudimentary layer estimator in recent versions. It's not very accurate though.
There's a rudimentary layer estimator in recent versions
How can I activate it?
Open the model in the GUI, and it will show it when you select the file.
I think a simple plaintext or .ini file with the MB/layer for each model would be an acceptable solution, comparing that to the GPU available memory would make it possible to somehow automate the correct amount.
I believe that's a bit inconsistent as not all model layers are equal in size, not to mention blas and context size also affect the total amount.
If you are like me and rotate around a number of models with different sizes, you can just make a bat file with a menu selection and pre-populate it with all the desired settings for each model. It takes a bit of setting up but once you have it you don't need to think about it again.
If you are like me and rotate around a number of models with different sizes, you can just make a bat file with a menu selection and pre-populate it with all the desired settings for each model.
That sounds like an awesome solution! Would you be so kind to share the script here? I will try to adapt it to my situation, THANK YOU!!!
my current script is a lot more convoluted with other QOL bells and whistles but the gist of it is:
echo off
echo 1. model_1_user_friendly_name
echo 2. model_2_user_friendly_name
echo 3. model_3_user_friendly_name
choice /c 123 /n /m "Press the number corresponding to the preset to load:"
cls
echo Starting with preset [%errorlevel%]
goto Option%errorlevel%
:Option1
koboldcpp --model model_name_1 --usecublas normal --blasbatchsize 1024 --gpulayers 52 --contextsize 4096
:Option2
koboldcpp --model model_name_2 --usecublas normal --gpulayers 35 --threads 12 --smartcontext --contextsize 16384
:Option3
koboldcpp --model model_name_3 --usecublas normal --threads 8 --contextsize 8192 --multiuser
You can go bananas with reusing chunks of options for example if you want to use the same context size and cublas for all prompts by storing them in a variable and using it throughout the individual "Options".
Do note that I prefer using the choice
function so I don't have to press any other key to pick an entry, but this limits you to 10 number entries max (1..9 and then 0 oddly enough, but 0 returns 10 as ErrorLevel
). You can opt to use letters instead (read more here).
Alternatively you can just use a set /p choice=Type the entry to select then press enter.
so you can input any number of digits and do the option switching on the choice
value instead of ErrorLevel
Anyone have any idea how to select auto layers by command line? (because Linux version has no GUI)
You can't set the layers automatically from the command line, nor are you advised to. The GUI one is only a guideline intended for noobs, you still have to trial and error to find the real value.
Update on this: The newest version 1.73 (at this time) now has reasonably good auto GPU layers estimation. Give it a try!
A setting that automatically gauges available VRAM and compares it with the size of the model being loaded into memory and selects the 'safe max' would be a nice QoL feature for first time users. Currently available VRAM would be a good way to gauge it