LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.36k stars 312 forks source link

[Enhancement] (v1.6.3) - 128k context, please. #804

Open SabinStargem opened 2 months ago

SabinStargem commented 2 months ago

I have been roleplaying with CommandR+, an 104b model. So far, I am using 40,000 out of 65,000 context with KoboldCPP. Considering that this model has been lucid so far, I am expecting to eventually hit the context limit of vanilla Kobold soon. If models are becoming that reliable with long context, then it might be time to add support for a bigger size.

LostRuins commented 2 months ago

Max Context Size has actually been increased for some time already. 128k should work, assuming the rope settings are correctly configured.

Vladonai commented 2 months ago

Without optimization, it's completely meaningless. Even an 8k context for a 70B model reduces the generation speed (already small) by more than half. Maybe the Commander-r family of models do better, but there's another problem there - huge memory requirements to handle the context...

SabinStargem commented 2 months ago

I got about 80 out of 128 gigs of DDR4 3600 RAM used by my PC in the Windows hardware monitor, while set to 65k in KoboldCPP. This implies that it is becoming practical for a serious gaming rig to run a 104b at big context. It is slow, but doable.

Processing Prompt [BLAS] (109 / 109 tokens) Generating (52 / 2048 tokens) (EOS token triggered!) CtxLimit: 41420/65536, Process:5.91s (54.2ms/T = 18.43T/s), Generate:199.18s (3830.3ms/T = 0.26T/s), Total:205.09s (0.25T/s)

In any case, the slider for setting context within the Kobold launcher only goes up to 65k. That is why I asked for the enhancement, since trying to figure out the esoteric art of ROPE is something that I want to forget. While I probably could eventually figure that out, I would rather just be a casual end user.

LostRuins commented 2 months ago

Hmm I understand, though at the moment I think most casual users don't have a use case for going past 65k ctx. If you need to, it's easy to just use the CLI and run with --contextsize 128000 for example and that should work.

If ultra long context does become a norm in the future then I will add additional options to the gui.

v3ss0n commented 2 months ago

its become a norm with phi-3 even supporting 128k. Gonna experiment with the CLI option - but built in optimization would be nicer , Thanks.

LostRuins commented 2 months ago

Very well, I can add an option...

v3ss0n commented 2 months ago

thats cool , thanks alot

LostRuins commented 2 months ago

Hi, Should be added in the latest version

v3ss0n commented 2 months ago

You Are Awesome!

mercurial-moon commented 1 month ago

I got about 80 out of 128 gigs of DDR4 3600 RAM used by my PC in the Windows hardware monitor, while set to 65k in KoboldCPP. This implies that it is becoming practical for a serious gaming rig to run a 104b at big context. It is slow, but doable.

Processing Prompt [BLAS] (109 / 109 tokens) Generating (52 / 2048 tokens) (EOS token triggered!) CtxLimit: 41420/65536, Process:5.91s (54.2ms/T = 18.43T/s), Generate:199.18s (3830.3ms/T = 0.26T/s), Total:205.09s (0.25T/s)

In any case, the slider for setting context within the Kobold launcher only goes up to 65k. That is why I asked for the enhancement, since trying to figure out the esoteric art of ROPE is something that I want to forget. While I probably could eventually figure that out, I would rather just be a casual end user.

@SabinStargem would be interesting to know what level of quantization is your model, 104b is pretty intense for consumer grade pc to run. Your memory usage suggest that you would be using any where close to a 4 to 5 bit quantization. Are you offloading some layers to GPU? What GPU are you running on, a RTX4090? Asking this because I tried running a 70b model and it ground my pc to a halt.

win10ogod commented 1 month ago

Hmm I understand, though at the moment I think most casual users don't have a use case for going past 65k ctx. If you need to, it's easy to just use the CLI and run with --contextsize 128000 for example and that should work.

If ultra long context does become a norm in the future then I will add additional options to the gui.

It is hoped that the kv cache can be cached in ram, and a method similar to PagedAttention can be used to maintain sufficient generation speed between vram and ram.