Open aleksusklim opened 9 months ago
The issue is not conceptual, rather it is the way the ring buffer is implemented. Llamacpp does not have this issue because they never allow users to manually remove tokens from the middle of the context. Although I suppose if you use it with n_keep then it may have the same issue too.
Why? Isn't it better to always resort to the server value? What is the point in having less context size in Lite?
There are situations where you don't need the full context, you can reduce it to allow unwanted early parts of the story to be truncated away.
Can it do that twice? Or for as much as it could, fitting each minibatch to the next continuous stride.
Sure, reduce your BLAS batch and predict less tokens at once.
In reality, I still don't understand what the model feels when it's memory is shifted. Does it "see" the gap? Imagine a context of 16k with 1k of memory and 3k of the active history at the end: will the model "understand" that there was 12k of "something" it cannot comprehend anymore, or it would see just 4k as direct concatenation?
Whatever that is shifted out is gone. There is no memory of it. You can see what the context contains by running it with --debugmode
If you are matching your context in cache against new user's prompt – then it should not matter, was the prompt truncated (from the beginning) or not: because even if it is truncated, but a true match is found – then your code should behave just as if it truncated that on its own. Why it is different?
When you shift tokens out of the context, you create "holes" inside the KV cache which later get filled with by new data. These holes have to be contiguous when handling processing for a large batch. I suppose it is possible that you can trigger this issue automatically too, but by default lite is designed to shift out the same number of new tokens that it attempts to generate, so the batch size should be the same size as the gap from any removed tokens letting it fit fine.
A better explanation would probably require looking through the shifting code here https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L593
by default lite is designed to shift out the same number of new tokens that it attempts to generate, so the batch size should be the same size as the gap from any removed tokens
Oh, interesting. Can this be also a reason why it is dangerous to edit something above the last turn?
For example, with low "amount to predict" (e.g. 128) when I edit "one turn above", the total amount of discarded tokens would be larger (maybe 200), but prediction batch is again 128. Wait, no, it should just shift 128 and be good again…
Whatever that is shifted out is gone. There is no memory of it.
But rotary position embeddings would be different for "text that had a gap" compared to its clean version (that's made by direct concatenation of "memory" and "the visible end of the history"), yeah? Or that doesn't matter much for the model's reasoning? (Thus, "it doesn't care" how older visible text was produced – with or without access to deleted parts – since now it sees only this visible text)
Sure, reduce your BLAS batch and predict less tokens at once.
If I'll set BLAS Batch Size:
to Don't Batch BLAS
should this work-around the problem?
I've tried to put --blasbatchsize -1
to my last example above and looks like it is not erroring anymore (updating the token counter each 8 tokens).
Also there are no more regenerations from scratch, after I raised my threshold for cutting text to be higher than real context size (5000 versus 4096).
Hm-m, would it be possible to make batch-size adaptive automatically by largest hole in the cache?
Alternatively, maybe you should just disable blas-batching completely after the first context shifting? (Enabling it again when full re-evaluation happens for any reason). Because batching is very important during initial prompt ingestion, but not so when the user is playing turn-by-turn.
lite is designed to shift out the same number of new tokens that it attempts to generate
STILL (I wanted to say something here, but I've made further experiments, and now my results did not seem right again!)
Remember my command that I showed? I put -1
for batch size, done.
Now remember my initial prompt with numbers, right from my previous logs?
I send it and see 4080 tokens
Then I send it again, I see 1 tokens
I remove last two turns from the end:
…USER: 61313977567118212\nMODEL: 455090158550265\nUSER: 286579237057171\nMODEL:
↓
…USER: 61313977567118212\nMODEL:
And send it.
4042 tokens
It regenerated! Why!? ContextShifting haven't even kicked in!
I mean, the context of "current" generation, that is super-fast for regeneration of latest action, and for minor editing of history.
I propose adding an option/param, that points to local binary file. If existed, koboldcpp should read the context from there. While this option is active, koboldcpp should update this file with each new context after generation. (Read once at start, rewrite/append during runtime).
I can name two main reasons, why this will be extremely useful:
Yes, I understand that any accidental move – and I can easily destroy the cache (loading wrong history, adding a space from the start of text, etc.) which would ultimately lead to full regeneration. But! If I would play locally and alone, that would be only my own fault. Avoiding that, I can restart my system anytime, and continue playing instantly later.
Also, for use-case about big story templates, you might give an additional option for this context to be read-only, as in https://github.com/ggerganov/llama.cpp/pull/1640