Closed cgessai closed 1 year ago
Why do you want to reduce your context size to only 512? Contextsize affects the amount of memory allocated. Lowering it from the default of 2048 is not recommended.
It was another attempt to deal with the issue of running out of VRAM unexpectedly when --useclblast-assisted prompt ingestion 'bumps' up its VRAM usage due to a larger prompt- but without turning off accelerated prompt ingestion entirely via --blasbatchsize -1.
I noticed a pattern of loading the program, finding the maximum number of gpulayers that can be offloaded for a new model, using the program for a while and then unexpectedly running into an out of VRAM error because the prompt went over X tokens. Since there's no way (AFAIK) to preemptively tell how many tokens a prompt is before hitting submit, this was an attempt to 'lock' the context and hopefully attack the problem without having to run every model at reduced --gpulayers on the possibility that I might accidentally go over the number of 'safe' prompt tokens. --help does list 512 as one of the acceptable values for --contextsize so I figured I'd try it.
AFAIK, there's no way I can assign the prompt ingestion portion of --useclblast to one GPU while sending the --gpulayers to another. That would fix this issue. Or if there was a command line arg to prevent a prompt over 512 tokens being processed (i.e. "Your prompt is over 512 tokens, please shorten it then hit Submit again") rather than giving up what amounts to a few hundred megs of offloaded GPU layers.
That is because prompt processing also requires some vram that the GPU uses, apart from the offloaded layers. To get a safe estimate, you should run a test prompt at max context and see if the number of layers you picked has enough memory for that.
Can't you just reduce the max tokens within the settings panel of the UI itself (or within the UI of SillyTavern or another frontend used to access via API) rather than limiting it from the commandline?
IIUC the --contextsize parameter is meant for increasing the max context beyond the default 2048 for models that support longer context.
Also I'm of the opinion that you're better off just offloading fewer GPU layers so that you don't OOM during prompt processing. I've found through trial and error that I can get 31 layers of a 13B q4_0 onto my 8gb vram and not OOM during prompt ingestion.
Thanks for the responses and ideas. I don't have a link but last night I saw "setting to limit prompt tokens to X" feature request was already logged against llamacpp so maybe it'll eventually come from upstream.
Latest version, 1.29. Including "--contextsize 512" to the command line prompt frequently leads to crashes during longer (Instruct, at least) outputs.
To reproduce:
RESULT: While it doesn't happen 100% of the time, KoboldCPP will frequently error out after successful prompt ingestion and after some substantial portion of generated tokens. In the UI, a 'failed to fetch' error dialog box will pop up. In the CMD window, the same exception message will be produced with the exception of those items in red boxes, which will sometimes read different values. The only way to continue it to restart the whole program.
I've reproduced this in both Wizard-Vicuna-7B-Uncensored Q4_0 and Wizard-Vicuna-13B-Uncensored Q5_1
Testing URL was always: http://10.0.0.155:5001/?streaming=1# CMD argument for launch (7B and 13B were similar) was: koboldcpp1.29.exe --stream --smartcontext --host 10.0.0.155 --threads 6 --useclblast 0 0 --launch --model "G:\GPT2-MODELS\TheBloke.Wizard-Vicuna-7B-Uncensored-GGML\Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin" --unbantokens --blasbatchsize -1 --gpulayers 24 --contextsize 512
The only change to cmd line args during testing was --contextsize 512. When it was not present the error was not observed.