LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.14k stars 353 forks source link

--contextsize 512 causing crashes? #229

Closed cgessai closed 1 year ago

cgessai commented 1 year ago

Latest version, 1.29. Including "--contextsize 512" to the command line prompt frequently leads to crashes during longer (Instruct, at least) outputs.

To reproduce:

  1. In app, click New Game button and Accept button
  2. Click Scenarios, New Instruct, and Confirm button
  3. Click Settings button, change "Amount to Gen" to 512 and click OK button
  4. Enter a prompt such as "Write 40 long paragraphs about the history of England." and hit Submit

RESULT: While it doesn't happen 100% of the time, KoboldCPP will frequently error out after successful prompt ingestion and after some substantial portion of generated tokens. In the UI, a 'failed to fetch' error dialog box will pop up. In the CMD window, the same exception message will be produced with the exception of those items in red boxes, which will sometimes read different values. The only way to continue it to restart the whole program.

I've reproduced this in both Wizard-Vicuna-7B-Uncensored Q4_0 and Wizard-Vicuna-13B-Uncensored Q5_1

Testing URL was always: http://10.0.0.155:5001/?streaming=1# CMD argument for launch (7B and 13B were similar) was: koboldcpp1.29.exe --stream --smartcontext --host 10.0.0.155 --threads 6 --useclblast 0 0 --launch --model "G:\GPT2-MODELS\TheBloke.Wizard-Vicuna-7B-Uncensored-GGML\Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin" --unbantokens --blasbatchsize -1 --gpulayers 24 --contextsize 512

The only change to cmd line args during testing was --contextsize 512. When it was not present the error was not observed.

LostRuins commented 1 year ago

Why do you want to reduce your context size to only 512? Contextsize affects the amount of memory allocated. Lowering it from the default of 2048 is not recommended.

cgessai commented 1 year ago

It was another attempt to deal with the issue of running out of VRAM unexpectedly when --useclblast-assisted prompt ingestion 'bumps' up its VRAM usage due to a larger prompt- but without turning off accelerated prompt ingestion entirely via --blasbatchsize -1.

I noticed a pattern of loading the program, finding the maximum number of gpulayers that can be offloaded for a new model, using the program for a while and then unexpectedly running into an out of VRAM error because the prompt went over X tokens. Since there's no way (AFAIK) to preemptively tell how many tokens a prompt is before hitting submit, this was an attempt to 'lock' the context and hopefully attack the problem without having to run every model at reduced --gpulayers on the possibility that I might accidentally go over the number of 'safe' prompt tokens. --help does list 512 as one of the acceptable values for --contextsize so I figured I'd try it.

AFAIK, there's no way I can assign the prompt ingestion portion of --useclblast to one GPU while sending the --gpulayers to another. That would fix this issue. Or if there was a command line arg to prevent a prompt over 512 tokens being processed (i.e. "Your prompt is over 512 tokens, please shorten it then hit Submit again") rather than giving up what amounts to a few hundred megs of offloaded GPU layers.

LostRuins commented 1 year ago

That is because prompt processing also requires some vram that the GPU uses, apart from the offloaded layers. To get a safe estimate, you should run a test prompt at max context and see if the number of layers you picked has enough memory for that.

DocShotgun commented 1 year ago

Can't you just reduce the max tokens within the settings panel of the UI itself (or within the UI of SillyTavern or another frontend used to access via API) rather than limiting it from the commandline?

IIUC the --contextsize parameter is meant for increasing the max context beyond the default 2048 for models that support longer context.

Also I'm of the opinion that you're better off just offloading fewer GPU layers so that you don't OOM during prompt processing. I've found through trial and error that I can get 31 layers of a 13B q4_0 onto my 8gb vram and not OOM during prompt ingestion.

cgessai commented 1 year ago

Thanks for the responses and ideas. I don't have a link but last night I saw "setting to limit prompt tokens to X" feature request was already logged against llamacpp so maybe it'll eventually come from upstream.