LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.99k stars 349 forks source link

Removing a single token from an overflow context will cause the entire context to be reprocessed #725

Open jojorne opened 7 months ago

jojorne commented 7 months ago

I'm having a lot of fun with KoboldCpp. I can generate and edit text. It's very fast until the context overflows with tokens.

Steps to reproduce:

After some analysis, I came to the conclusion that KoboldCpp sees the free space and tries to fit as many tokens as possible in the free space. This will cause the entire context to shift and be reprocessed. Here, take a look. This will fit as many tokens as possible from the beginning of the story now that we have free space:

let truncated_context = concat_gametext(true, "");
truncated_context.substring(truncated_context.length - max_allowed_characters);

Since there is nothing that prevents the context from shifting backwards, now the entire context is invalid and needs to be reprocessed. There are many places where things like this happen. See, this doesn't happen if the context is not in an overflow state because there will be no truncated text to shift the context backwards and invalidate the entire context.

LostRuins commented 7 months ago

That's natural because the ui has no idea about your intentions, just truncating the text to fit max allowance.

Yes, for a single token, backtracking the entire context is wasteful. However, what if you chose to remove 20 tokens instead? Or 2000? At a certain point, you'd want the old context to be included into the text.

jojorne commented 7 months ago

I was tinkering with it and first I created a page concept, but when I turned the page, suddenly the AI lost memory because all the context was gone. Then I compared it to a video game camera. You generally don't want to copy the player's exact position and rotation. Imagine the player climbing a staircase quickly, think about what would happen to the camera. So it's like you said, there would have to be a certain percentage of free buffer after adding the memory and author notes like 1/3th? The problem is that only Kobold Lite would have this support, others like SillyTavern would be without it. That's why I decided to open a ticket. Maybe someone will come up with a better idea?

jojorne commented 7 months ago

I saw two interesting news today. LLM support for Android through MediaPipe and this in a PR from llama.cpp - there might be more prompt re-processing than necessary in the server example, especially if your client trims the end of the output and State backtracking - Would be very useful to reduce prompt reprocessing. I'll keep my fingers crossed. 🤞