LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

Support `--ctx_size` to set context size. #150

Closed JHawkley closed 1 year ago

JHawkley commented 1 year ago

This is a feature from llama.cpp that kobold.cpp currently does not support.

The recently released Bluemoon RP model was trained for 4K context sizes, however, in order to make use of this in llama.cpp, you must use the --ctx_size 4096 flag to enable larger contexts without seg-faulting.

kobold.cpp has no support for this flag, and so this model cannot be utilized to its fullest potential. You can check this comment/discussion on HuggingFace going over this for more context.

Note: it does appear that a LLaMA model must be finetuned for 4k, otherwise you just get gibberish. Unfortunately, this will not automagically expand the context of all LLaMA models, but it's pretty cool that a finetune can add this capability. We may see more models tuned like this in the future.

LostRuins commented 1 year ago

Fyi Koboldcpp already supports sending any context size you want by manually editing the value inside the Lite UI (it's a text input) image

However, looking at the llama.cpp code they are still using fixed sized scratch buffers for the context, and they are also fixing the memory allocation at load time. image

Can you confirm that this model works correctly in llama.cpp at 4096 context? If it does, then it should work in KoboldCpp too. I will see if I can get the buffers resized correctly

Edit: I can confirm that they've added more hackery that only resizes the KV caches at load time (previously it was dynamically resized during eval. I will add a new parameter to set max ctx len)

LostRuins commented 1 year ago

Hi, this has been added as of version 1.20. You can toggle a larger context with the --contextsize launch parameter.

ghost commented 1 year ago

Hi, this has been added as of version 1.20. You can toggle a larger context with the --contextsize launch parameter.

Hi, I am using koboldcpp on my android device. My prompt is like this, python ~/koboldcpp/koboldcpp.py ~/llama.cpp/models/RWKV-4-Raven-3B-v11-Eng99-Other1-20230425-ctx4096-ggml-q5_1.bin --threads 6 --blasbatchsize 32 --contextsize 4096

But when I send a message through kobold lite, I can see "max_context_length": 2048", being passed.

Am I missing something? This should reflect my specificed context, right?

Edit: I manually edit the context through kobold lite to 4096 and it appears to be working.

JHawkley commented 1 year ago

Aye, you do need to do both --contextsize and change the config in Kobold Lite's settings to get the full context use. For others dropping in here for help, even though the slider only goes up to 2048, you can click the number and input any value you like.

Me and a roommate have tested it and we are seeing it work great! I guess I'll close this issue. Thanks for putting it in so quick.