Open SabinStargem opened 2 months ago
Max Context Size has actually been increased for some time already. 128k should work, assuming the rope settings are correctly configured.
Without optimization, it's completely meaningless. Even an 8k context for a 70B model reduces the generation speed (already small) by more than half. Maybe the Commander-r family of models do better, but there's another problem there - huge memory requirements to handle the context...
I got about 80 out of 128 gigs of DDR4 3600 RAM used by my PC in the Windows hardware monitor, while set to 65k in KoboldCPP. This implies that it is becoming practical for a serious gaming rig to run a 104b at big context. It is slow, but doable.
Processing Prompt [BLAS] (109 / 109 tokens) Generating (52 / 2048 tokens) (EOS token triggered!) CtxLimit: 41420/65536, Process:5.91s (54.2ms/T = 18.43T/s), Generate:199.18s (3830.3ms/T = 0.26T/s), Total:205.09s (0.25T/s)
In any case, the slider for setting context within the Kobold launcher only goes up to 65k. That is why I asked for the enhancement, since trying to figure out the esoteric art of ROPE is something that I want to forget. While I probably could eventually figure that out, I would rather just be a casual end user.
Hmm I understand, though at the moment I think most casual users don't have a use case for going past 65k ctx. If you need to, it's easy to just use the CLI and run with --contextsize 128000
for example and that should work.
If ultra long context does become a norm in the future then I will add additional options to the gui.
its become a norm with phi-3 even supporting 128k. Gonna experiment with the CLI option - but built in optimization would be nicer , Thanks.
Very well, I can add an option...
thats cool , thanks alot
Hi, Should be added in the latest version
You Are Awesome!
I got about 80 out of 128 gigs of DDR4 3600 RAM used by my PC in the Windows hardware monitor, while set to 65k in KoboldCPP. This implies that it is becoming practical for a serious gaming rig to run a 104b at big context. It is slow, but doable.
Processing Prompt [BLAS] (109 / 109 tokens) Generating (52 / 2048 tokens) (EOS token triggered!) CtxLimit: 41420/65536, Process:5.91s (54.2ms/T = 18.43T/s), Generate:199.18s (3830.3ms/T = 0.26T/s), Total:205.09s (0.25T/s)
In any case, the slider for setting context within the Kobold launcher only goes up to 65k. That is why I asked for the enhancement, since trying to figure out the esoteric art of ROPE is something that I want to forget. While I probably could eventually figure that out, I would rather just be a casual end user.
@SabinStargem would be interesting to know what level of quantization is your model, 104b is pretty intense for consumer grade pc to run. Your memory usage suggest that you would be using any where close to a 4 to 5 bit quantization. Are you offloading some layers to GPU? What GPU are you running on, a RTX4090? Asking this because I tried running a 70b model and it ground my pc to a halt.
Hmm I understand, though at the moment I think most casual users don't have a use case for going past 65k ctx. If you need to, it's easy to just use the CLI and run with
--contextsize 128000
for example and that should work.If ultra long context does become a norm in the future then I will add additional options to the gui.
It is hoped that the kv cache can be cached in ram, and a method similar to PagedAttention can be used to maintain sufficient generation speed between vram and ram.
I have been roleplaying with CommandR+, an 104b model. So far, I am using 40,000 out of 65,000 context with KoboldCPP. Considering that this model has been lucid so far, I am expecting to eventually hit the context limit of vanilla Kobold soon. If models are becoming that reliable with long context, then it might be time to add support for a bigger size.