LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

Feature request: minimum context tokens #930

Open Azirine opened 2 weeks ago

Azirine commented 2 weeks ago

When the context is full, removing the last query and trying something else requires reprocessing the entire context, wasting a lot of time and compute. This happens because the remaining context is less than "maximum context tokens".

Therefore, I propose have a setting for "minimum context tokens". For example, if "maximum context tokens" is 8k and "minimum context tokens" is 6k, context will begin shifting once 8k is hit. However, we can also delete up to 2k tokens from the end without having to reprocess the entire context. This feature would represent a massive speedup when context is full, with the flexibility of being user adjustable.

LostRuins commented 2 weeks ago

This can't really be done, because once a token is shifted out of the context it is lost forever. You can actually delete as many tokens as you want from the end of the context without reprocessing - the problem happens when tokens from the start of the context change.

Perhaps you can also try using --smartcontext which is an older approach but may be more similar to what you want.

Azirine commented 2 weeks ago

I think there's a misunderstanding. I know once a token is shifted out of the context it is lost forever, but this does not mean it has to go back and reprocess everything when the number of tokens in the context drops below the context limit. It can simply accept that the context may not be at the limit, but is still more than enough to continue with generation.

You cannot actually delete as many tokens as you want from the end of the context without reprocessing, even if the start of the context doesn't change, no matter with --smartcontext or not. You can try it for yourself and observe this behaviour when context is full. It will always reprocess to get a few more tokens that was lost due to context shifting, which is absolutely not necessary because a few more tokens in the context has little effect on the quality of generation, but takes a lot of time especially when context is large.

LostRuins commented 1 week ago

Let me give you a very simple example. Let's assume the context limit is 6 words.

Now imagine this is your story: Hello, my name is bob. The quick brown fox jumps over the lazy So, the actual context currently contains: brown fox jumps over the lazy if you generate 1 more word, the context becomes fox jumps over the lazy dog Now, imagine you now delete 3 words instead. This makes the context that is submitted be: The quick brown fox jumps over Since there is no match for the start part, a forced reprocess is required.

Azirine commented 1 week ago

I am suggesting a "minimum context" setting, where if set to 3 or below, after deleting 3 words the context submitted would be fox jumps over, eliminating the need to reprocess. If it is set to 4 or above, the remaining context of 3 words would be below the minimum, therefore it would go back to reprocess all 6 words The quick brown fox jumps over.