LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 313 forks source link

[Feature request] Ability to cache context between runs for faster initial generation of the same history (after app restart) #445

Open aleksusklim opened 9 months ago

aleksusklim commented 9 months ago

I mean, the context of "current" generation, that is super-fast for regeneration of latest action, and for minor editing of history.

I propose adding an option/param, that points to local binary file. If existed, koboldcpp should read the context from there. While this option is active, koboldcpp should update this file with each new context after generation. (Read once at start, rewrite/append during runtime).

I can name two main reasons, why this will be extremely useful:

Yes, I understand that any accidental move – and I can easily destroy the cache (loading wrong history, adding a space from the start of text, etc.) which would ultimately lead to full regeneration. But! If I would play locally and alone, that would be only my own fault. Avoiding that, I can restart my system anytime, and continue playing instantly later.

Also, for use-case about big story templates, you might give an additional option for this context to be read-only, as in https://github.com/ggerganov/llama.cpp/pull/1640

aleksusklim commented 9 months ago

Follow-up: is it technically possible to "download" the current context history from the koboldcpp server? As text (render is server-side, if is stored as tokens in memory).

If so, can we have an action (somewhere in settings window of Lite) to replace current history in browser by its fresh copy from the server? This is needed for three cases: 1) If browser lagged badly and somehow destroyed the history (fox example when the tab gets discarded due to low RAM conditions; I saw that myself! History was blank after tab unexpected reload mid-generation; then I grabbed whatever was printed on console and stuffed it in conversation .json file, that worked and the context cache was not regenerated) 2) If the browser was disconnected before the result could be sent. Current way to recover the model response is to copy it from the console manually. Simple case: tab reload (due to low RAM again) or browser restart mid-generation. 3) If user accidentally messed up with history, which would lead to cached context mismatch and full regeneration. Downloading the actual history ensures that the next submit won't lead to BLAS re-evaluation. Also this might be useful when changing modes (from Story to Instruct and back) and be sure that no occasional non-stripped spaces would lead to context mismatch again.

Cautions:

Overall, this would be super-useful with my original on-topic suggestion about caching contexts to files: then, you could load a binary context from file when starting the server – and then get its actual text in browser to continue flawlessly!

LostRuins commented 9 months ago

This is probably not very suitable for KoboldCpp - the context would be massive (few gigabytes) and the cost of writing it to disk from memory or storing it somewhere will be prohibitive and slow, for something that becomes useless shortly after. Have you tried using GPU acceleration? Processing speeds increase greatly when using Cublas or CLBlast compared to without.

YulienPohl commented 9 months ago

This could also be beneficial for group chats where all tokens are reprocessed every round. Or am I doing something wrong?

aleksusklim commented 9 months ago

the context would be massive (few gigabytes) and the cost of writing it to disk from memory or storing it somewhere will be prohibitive and slow

Then, it should be rewritten partially, so only the last response is written at the end of the file from proper offset… Now I understand that it will require quite massive amount of code to implement properly. Along with a custom file format.

You think it is not worth it?

(Yes, I know about CLBlast, but I have different machines, including weak 16 Gb RAM + 4 Gb VRAM where full re-evaluation is a pain!)

Vladonai commented 8 months ago

I too think this feature could be very useful for huge models. Implementation could be like this: -usesavecache parameter at startup = save cache (by current model name) at exit (by Ctrl+C). When restarting the program - check by model name and if cache and model names coincide, then load the saved cache. This will not load the system at each request, but the speed of the first response will SIGNIFICANTLY increase.

aleksusklim commented 8 months ago

save cache (by current model name) at exit (by Ctrl+C).

This won't help with system crash: the cache won't be saved. I'll rather opt-in to save cache before each generation, unless there is nothing changed (e.g. regeneretion of the last query).

check by model name and if cache and model names coincide, then load the saved cache

It is not enough to check only the name, rather metadata evaluation would be required (e.g. mismatched koboldcpp versions if the cache was left from old one). And still, the file format should be defined and implemented, which is not trivial.

I've checked LM Studio (https://lmstudio.ai/) and looks like they've implemented full cache saving in runtime for each dialog separately, printing used disk space.

Vladonai commented 8 months ago

save cache (by current model name) at exit (by Ctrl+C).

This won't help with system crash: the cache won't be saved. I'll rather opt-in to save cache before each generation, unless there is nothing changed (e.g. regeneretion of the last query).

Yes, but to be honest my Koboldcpp doesn't crash most of the time. It's a very stable program. So some light tuning is certainly desirable, but writing cache after each request will cost too much IMHO.

check by model name and if cache and model names coincide, then load the saved cache

It is not enough to check only the name, rather metadata evaluation would be required (e.g. mismatched koboldcpp versions if the cache was left from old one). And still, the file format should be defined and implemented, which is not trivial.

Well, yes, but the details of implementation are best left to the author :)

aleksusklim commented 8 months ago

I didn't say that it's koboldcpp who is crashing, but my system! On one particular low-RAM machine I experience hangs and crashes after using koboldcpp (or any other engine to run GGML models).

For example, I can load 13B model with 16 Gb of RAM just fine (takes, for example, 15.7/16.0 used total by all processes), then close koboldcpp, and then get system hang after opening Firefox because something in memory was already corrupted! (Already tried changing my RAM, booting from another Windows disk, using Safe mode, etc.; previously it happened with 7B models on 8 Gb or RAM) Or it could crash after my workstation locks during to inactivity, unable to login again when the model is loaded in RAM.

I know that this is a completely different problem, but having the context cached would made those crashes not so painful.

On the other hand, on my hi-RAM machine (64 Gb plus 12 VRAM) not only it is not crashing whatsoever, but the full context re-evaluation is pretty fast. So, yeah… "Just get a higher-tier PC" is the solution here too.

Vladonai commented 8 months ago

On the other hand, on my hi-RAM machine (64 Gb plus 12 VRAM) not only it is not crashing whatsoever, but the full context re-evaluation is pretty fast. So, yeah… "Just get a higher-tier PC" is the solution here too.

Starting with the 70B-models, even on a "higher level PC" the first response at a context size of 4K has to wait too long. This is not the limit. GGUF models support context sizes up to 16K and such a context may require some solution even for a 13K model. In general, we must admit that the problem does exist.

e576082c commented 8 months ago

From the perspective of a simple user, it "feels like", that a lot more time is being spent on "prompt ingestion", "prompt processing" or whatever is the first step before the actual text generation process begins.

The actual text generation speed, with streaming enabled, never "feels like" it's slow, because I can spend time actively reading the text while it is being generated.

However, if I do change tasks, scenarios, characters, or do any edit in the chat history, then the next generation will always have a very noticeable delay lag, like as if the process is hanging and frozen, it does always "feel like" it's slow, even with cuBLAS, openBLAS, CLBLAS, or whatever magical accelerator backend I choose. If the context or history is changed, then the beginning of the next generation will always "feel slow".

I don't know how could this be solved or improved, but some miracle would be good to happen. lol

Might be related: [Enhancement] Multiple instances of koboldcpp at a time using the same model, also using different models simultaneously. #163

[Question] Confused about how / why full BLAS processing happens #386

When editing response, reprocessing prompt length varies randomly #385

LostRuins commented 8 months ago

In the next version, there will be a brand new feature for context shifting that should allow you to significantly reduce processing time.

e576082c commented 8 months ago

In the next version, there will be a brand new feature for context shifting that should allow you to significantly reduce processing time.

Thank you very much! That would be great. :)

aleksusklim commented 8 months ago

I can reply to this from what I know. (Let's pretend that the new "context shifting" is not exists yet; still, everything I'll say will be useful!)

  1. Any model has "context window" measured in tokens. For llama 1 it was 2048, for llama 2 it is 4096.
  2. The model physically cannot process anything beyond the pre-allocated context memory. To claim more space, you have to allocate more from the start.
  3. Most models do not perform well when their context window is enlarged beyond their native length. The quality degrades quickly.
  4. There are several methods of extending context length are proposed, it is called "RoPE". One of its variant requires to fine-tune models on higher contexts to be able to use it.
  5. Luckily, for GGUF models the correct parameters for extending the context are determined automatically, that's why for koboldcpp it is as simple as changing your desired context length when starting the server.

Notes:

  1. You cannot easily see, how long is your current context. (Why? What is "token budget" displayed in the corner? Isn't it better to display there the actual token counter and limit, since it is already exposed in API?) Good news, recent versions of koboldcpp are printing this after each generation as ContextLimit: <used>/<total>
  2. Your client (browser) knows the context length too, and can adjust it. For example, your server may be running 4096 while your client is still on 2048 tokens. (Why? Isn't it better to always resort to the server value? What is the point in having less context size in Lite?) You should always open Settings and move the Max Ctx. Tokens to the right, making sure numbers above and below its right part are equal.
  3. The slider Amount to Gen. is actually paying from total context size, because it is reserving some room for the model to answer. For example, if your context is 2048 but you set Amount to generate = 512, then your effective context for history will be 1536, even if the model stops early and never writes that much of text.

What will happen if you run out of context?

  1. Each normal turn consist of two stages: processing your prompt ("BLAS") and generating response ("streaming"). Everything that was already processed is caching in memory, so when you regenerate the last output without changing anything – there would be no processing/BLAS part (it will show up as 1 token), but only generation right away.
  2. Processing takes long time and is invisible in browser. The longer the text of your turn is, the longer you'll have to wait for generation to start (but various BLAS optimizations can make this pretty quick, especially with good GPU).
  3. When the context has no room for the next line, koboldcpp will trim the history from the beginning. This means that some of your earlier turns will effectively disappear from the memory completely. The model would never see them from now on.
  4. If you have anything in Memory tab, then it will be preserved as the beginning of the history. Actually, Memory it just a way to split the history into "keep this always" and "delete if needed" parts, so whenever you run out of context only a few of topmost lines are deleted (so there would be a gap between "memory" and "dialogue" as if a book was missing a page right after introduction of characters).
  5. Since the context was trimmed and recreated, it has to be BLAS-processed from the start. There is now way for the model to magically shift it without re-evaluation (YET!?). And since your context was already full, this will take the longest possible time, just as if you have restarted the server. It is more than 20 times slower than your regular play speed, provided your normal turns are about 100-150 tokens, while the history is somewhere at 2000-3000 and must be processed from the scratch.

What can you do about it?

  1. You may continue paying as-is. Each your turn will trigger the full context reevaluation, but you might be fine with it, especially if the model can fit in GPU entirely. Regenerations of the last reply would still be instant, only your own turn will become slow.
  2. You can manually delete something from the beginning of your story. For example, you can describe a half of actions as an extended introduction, and remove the redundant lines. This way you could empty a lot of context room yourself, in full control of what the model would know.
  3. You can enable Use SmartContext when starting the server. It will split the history for you the next time you run out of context (and otherwise won't hurt if you'll never overflow). Actually, it will simply cut the history in two, deleting the first half (and preserving the Memory), so that each time you will continue from half-empty context window. This way you could more or less play normally, but experience full evaluations only occasionally. Just be warned that the model might forget many of older events.
  4. You can restart the server and increase Context Size. Personally, I see no quality degradation for llama2 models if I run them with 8192 (no Custom RoPE Config!) instead of native 4096 right from the beginning. But if you do, fine, play with 4096 and restart in 8196 when you run out of context. This way you will double your window for free, and the model would still write coherent texts because it has a good history to mimic. If you hit 8192 – increase to 12288, it will still work, as long as you continue playing and not starting a new game (where it would be noticeably worse). If you hit 12288, you can try going to 16384, but unfortunately you will have to regenerate a lot, since the degradation would become obvious.

To protect yourself from an accidental context reevaluation, you can:

  1. Keep your "amount to generate" low in browser (I prefer 96). Even if the model sometimes won't have a chance to finish its last sentence – you can just let it continue again, just make sure Trim Sentences is unchecked in Advanced tab.
  2. Do not use "Author's Note" (in Memory). Instead, explain your intent for the model yourself, for example some models like mythalion respect <|system|> messages which you can embed at your will (for example, by adding to the end of last output), and delete if they are not needed anymore (this will re-evaluate everything below your edition, of course).
  3. Look at the console when you feel the end is close. Maybe edit-out some unnecessary dialogue lines, which intent was already settled and won't literally affect your planned story later.
  4. Be careful to not edit anything in the textbox unintentionally, always undo your accidental changes.

As for my own confusion: LostRuins, can you explain why we see a rough token budget instead of precise token counter and context size? And also, what's the point of adjusting Max Ctx. Tokens in Lite, if the server will ignore too large value, and there is no practical point in having it less than what is available currently in koboldcpp? Or is there?

Vladonai commented 8 months ago

3. You can enable Use SmartContext when starting the server. It will split the history for you the next time you run out of context (and otherwise won't hurt if you'll never overflow). Actually, it will simply cut the history in two, deleting the first half (and preserving the Memory), so that each time you will continue from half-empty context window. This way you could more or less play normally, but experience full evaluations only occasionally. Just be warned that the model might forget many of older events.

I'm interested in how this mechanism works now. Because character descriptions are usually at the beginning of the context, and if the --smartcontext parameter cuts off the upper half of the context and writes the lower half there, it turns out that the model doesn't know anything about the characters. (In reality, this is not quite true). I would just like to understand how the mechanism works - and the new one as well.

aleksusklim commented 8 months ago
  1. Everything you put in Memory will stay at the top. This is implicitly used when you load a character JSON file (just check the memory and you'll see it).
  2. The model is quite capable of continuing the whatever dialogue you have so far. Even without any explicit description of your characters, the model can figure out their names, gender and species, approximate age, personality and writing style. It is trained to continue no matter what, without asking "Hey, who is that guy and how he appeared in this story?"

When you have an already established good history, you can switch the model to a completely different one – and most likely it would continue in the same spirit as before, because it adjusts and tries to mimic, even if it didn't know the correct chat/instruction format that you've used.

Vladonai commented 8 months ago
  1. Everything you put in Memory will stay at the top. This is implicitly used when you load a character JSON file (just check the memory and you'll see it).

In the Kobold Lite client - maybe, but what about developers of third-party applications that use API access? Here is the structure I use:

            var parameters = new
            {                
                n = 1,
                max_context_length = maxTokenCount,
                max_length = replyTokenCount, // 300 
                rep_pen = 1.19,
                temperature = 1.1,
                top_p = 1.0, // disabled
                top_k = 0,
                top_a = 0,
                typical = 1.0, // disabled
                tfs = 0.95,
                mirostat = 2, // type
                mirostat_tau = 0.5,
                mirostat_eta = 0.1,
                rep_pen_range = maxTokenCount,
                rep_pen_slope = 1.10,
                sampler_order = new int[] { 6, 0, 1, 2, 3, 4, 5 }, // defaut [6,0,1,3,4,2,5]
                prompt = myPrompt,
                quiet = true,
                stop_sequence = stop_seq
            };

The application is passed the whole prompt as it is ( prompt = myPrompt ), and then it works with it as it wants. I can double the context window and insert character descriptions after the first half, so that's definitely handled by the model. But I'd still like to know exactly how --smartcontext works :)

aleksusklim commented 8 months ago

I think it's in https://github.com/LostRuins/koboldcpp/blob/6a4d9c26e1eeb2119171b3ea21444d940c7a8a14/model_adapter.cpp#L393 and below:

…
const int SCCtxLenThreshold = nctx * 0.8; //how much context length must be reach to trigger smartcontext
const int SCInpLenThreshold = nctx * 0.6; //how big must the input array be to trigger smartcontext
const int SCPastLenThreshold = nctx * 0.5; //how wide of a gap between the fast forwarded past and the present to trigger smart context
const float SCTruncationRatio = 0.5; //ratio for how many tokens to fast forward
const int SCTokThreshold = 32 + (nctx*0.05); //how many tokens of similarity triggers smartcontext
…
//smart context mode, detect if we have a shifted context at max length
//requirement: previous context was at least nctx/2 longer than current,
//mode is on, and current context already maxed.
…
//determine longest common substring after removing start part
int shiftamt = embd_inp.size() * SCTruncationRatio;
smartcontext = std::vector<int>(embd_inp.begin() + shiftamt, embd_inp.end());
printf("\n[New Smart Context Triggered! Buffered Token Allowance: %d]",shiftamt);
…
//if max ctx length is exceeded, chop the prompt in half after the start part, and memorize it. The memorized part becomes LCS marker.
//when a future prompt comes in, find the LCS again. If LCS > a length and LCS starts with memorized LCS
//remove all tokens between start part and start of LCS in new prompt, thus avoiding shift
//if LCS not found or mismatched, regenerate. chop new prompt and repeat from step B
…
e576082c commented 8 months ago

3. When the context has no room for the next line, koboldcpp will trim the history from the beginning. This means that some of your earlier turns will effectively disappear from the memory completely. The model would never see them from now on.

4. If you have anything in Memory tab, then it will be preserved as the beginning of the history. Actually, Memory it just a way to split the history into "keep this always" and "delete if needed" parts, so whenever you run out of context only a few of topmost lines are deleted (so there would be a gap between "memory" and "dialogue" as if a book was missing a page right after introduction of characters).

3. You can enable Use SmartContext when starting the server. It will split the history for you the next time you run out of context (and otherwise won't hurt if you'll never overflow). Actually, it will simply cut the history in two, deleting the first half (and preserving the Memory), so that each time you will continue from half-empty context window. This way you could more or less play normally, but experience full evaluations only occasionally. Just be warned that the model might forget many of older events.

I'm interested in how this mechanism works now. Because character descriptions are usually at the beginning of the context, and if the --smartcontext parameter cuts off the upper half of the context and writes the lower half there, it turns out that the model doesn't know anything about the characters. (In reality, this is not quite true). I would just like to understand how the mechanism works - and the new one as well.

Thanks for the super detailed explanation! @aleksusklim

If "Memory" and SmartContext work in this way with the default koboldAILite UI, then as a simple user, I would very much like to know, what can I do to use this feature in SillyTavern. I mean, without coding, can I do anything to communicate to koboldcpp to treat my system prompt and characters as "Memory", and indicate somehow, when the actual chat log starts? Maybe with a modded template?

Is there any simple specific character, what indicates the end of the "Memory" field for koboldcpp? Like for example, the first |> or the first ], or anything like that? @LostRuins

So far I always avoided using SmartContext, because I was very unsure about what does it do, and I absolutely wanted to be sure, that the start of my prompt (including system info task, world lore, and character descriptions) will remain intact in place, at the start of my prompt.

In the aspect of preventing characters from producing Alzheimer's disease symptoms, forgetting their own personalty, it is super important to preserve the system prompt and characters descriptions, especially if the characters are originally made-up by my imagination, not well-known characters to the used language model.

2. The model is quite capable of continuing the whatever dialogue you have so far. Even without any explicit description of your characters, the model can figure out their names, gender and species, approximate age, personality and writing style. It is trained to continue no matter what, without asking "Hey, who is that guy and how he appeared in this story?"

Yes, of course it can just go on and continue whatever is thrown at it (just as an overly smart text auto-completing trick), but for an RPG game, it is more important, that the system info task and the character descriptions are kept in place, so nothing will act out of character (even if past events/turns may get forgotten).

aleksusklim commented 8 months ago

In the next version, there will be a brand new feature for context shifting that should allow you to significantly reduce processing time.

I've just read https://github.com/mit-han-lab/streaming-llm, and this is indeed very promising! Much better than SmartContext and plays nicely with Memory too.

It's a game changer both for this Issue (theoretical context caching) and that Issue: https://github.com/LostRuins/koboldcpp/issues/492 (having too long history in browser).

LostRuins commented 8 months ago

ContextShifting is Now live in v1.48. Please try.

Vladonai commented 8 months ago

ContextShifting is Now live in v1.48. Please try.

At first impression, the result is much more efficient than smartcontext. But: it doesn't solve the "first response" problem - the existing context has to be processed completely, which takes time. A context cache could solve this problem; "[Context Shifting: Erased 25 tokens at position 2]" - would like to be able to set the starting position to be able to save a custom "Memory". Maybe we should introduce a special tag "[KobMemEnd]" to define such a position? The user, having inserted it, will be sure that everything set before this tag will not be erased.

aleksusklim commented 8 months ago

it doesn't solve the "first response" problem

to save a custom "Memory". Maybe we should introduce a special tag "[KobMemEnd]"

Ultimately, this will not resolve the slow initial load problem even if we'll cache everything in Memory, because, generally speaking, the memory is not as large as the main history dialogue content.

Caching should support arbitrary lengths, rolled or not. By the way, upstream main.exe has its own prompt cache. Maybe its format can be adopted as is, by copypasting the relevant code? (I didn't look at it myself, but I've used those cache files when --prompt-cache option was first introduced)

The user, having inserted it, will be sure that everything set before this tag will not be erased.

Having a tag {{[MEMORY]}} (in align with {{[INPUT]}} and {{[OUTPUT]}}) will indeed help third-party clients to easily support memory feature without defining it explicitly.

aleksusklim commented 8 months ago

ContextShifting is Now live in v1.48. Please try.

@LostRuins, I tried 1.48.1 and looks like I don't understand something.

  1. I've set Context Size in GUI to 256 (minimum available). I've set Max Ctx. Tokens in Lite to 256 as well. And I've set Amount to Gen. to 64.
  2. I've started new game in Adventure Mode without Adventure Prompt, and have put MEM string to Memory tab.
  3. I am keeping to send > test as actions and watching the console.

I see that:

  1. ContextShift is not kicking in! It prints Processing Prompt [BLAS] after an overflow. But I'm sure it said SmartContext: False, ContextShift: True during initialization.
  2. Memory context is sent literally in "prompt": "MEM\n\n…" field, without specifying it as an explicit immutable part. How it supposed to work?
  3. After an overflow, client sends already-truncated input, splitting somewhere in-between of lines. Not at newline! I'm afraid, even if it would work – it could produce wrong format for the model (for example, the user might want for each string to start with special mark, but the very first line after the memory – will be started at random word most of the time).

Doesn't the server need either of:

As for truncation, I think the text processing should be something like:

  1. Cut the topmost line ended on \n; if not enough – repeat; if too much (an heuristic value? For example, "line was longer than 5% of history") – undo and next step.
  2. Cut the topmost sentence ending with proper punctuation (I think you already have a sentence-detection function). If not enough – repeat, if too much (the same threshold again, relative to the total amount of discarded text) – undo and next step.
  3. Cut the topmost word at the next space character. Same rules again.
  4. Cut by individual tokens, as many as needed. (Most of the normal time the point 1 will be enough; if there are no newlines – well, point 2 will do what it can; if for very weird reason there is no correct punctuation – okay, point 3 will cut exactly as right now; and if there are no spaces, just to be sure that we don't break – cut by tokens in point 4)

By the way, for me Trim Sentences checkbox in Lite is still trimming the incomplete model output even if unchecked! (Console shows more text than gets into browser) Or what should it do?

Since this my post contains several semi-unrelated observations, I can as well open a new Issue for each of them if you want, providing screenshots or console outputs.

LostRuins commented 8 months ago
  1. Context Shifting has a minimum required substring length, which is currently set to 256 tokens, therefore it won't kick in when your max context is 256. It also doesn't kick in if the estimated processing length is < 256 tokens. Try conducting your experiments with 512 context length instead.
  2. The memory is determined automatically by testing both the old and new contexts for matching tokens at the start, and fast forwarding that portion (considered as memory). Thereafter, there's a round of context shifting to remove discarded tokens from the old context, followed by a second fast forwarding to skip past the remainder of the matched similarity tokens. Please refer to this function: https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L592
  3. Token truncation is done at word and sentence boundaries. Please refer to this function here: https://github.com/LostRuins/lite.koboldai.net/blob/main/index.html#L7620 this seems to provide adequate results in most cases and its format independent (very important). For example, the user might be trying to generate a giant paragraph of text - in that case there may not even be any newlines or punctuation inside.
LostRuins commented 8 months ago

As for Trim Sentences, make sure you're in story mode to receive the raw output. Adventure mode will always format to sentences.

Vladonai commented 8 months ago

2. The memory is determined automatically by testing both the old and new contexts for matching tokens at the start, and fast forwarding that portion (considered as memory). Thereafter, there's a round of context shifting to remove discarded tokens from the old context, followed by a second fast forwarding to skip past the remainder of the matched similarity tokens. Please refer to this function: https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L592

That's a great idea, except it doesn't work for some reason: [Context Shifting: Erased (num) tokens at position 2] always. And furthermore: what happens if I want to change the memory? Add a new character, change something from an old one? The memory area should be specified explicitly, IMHO.

aleksusklim commented 8 months ago

Token truncation is done at word and sentence boundaries.

I see. I tried to propose iterative attempts to cut: given a safe length, firstly try at newlines, then try at sentence, and finally at words or characters.

For example, substring_to_boundary can be implemented like this:

function substring_to_boundary(input_string, maxlen)
{
    if(input_string.length <= maxlen)
    {
        return input_string;
    }
    else
    {
        let cutoff = input_string.length - maxlen;
        let trim = input_string.substring(cutoff);
        let safe_limit = Math.min(300, Math.floor(maxlen/20)); // no more than 300 chars or 5% of max length
        let match_lines = trim.match(/^.*?\n+\s*/);
        if(match_lines && match_lines[0].length <= safe_limit)
        {
            trim = trim.substring(match_lines[0].length-1); // always do -1 to include leading token!
        }
        else
        {
            let match_sentences = trim.match(/^.*?([.!?*")}`\];]+\s*)/);
            if(match_sentences && match_sentences[0].length <= safe_limit)
            {
                trim = trim.substring(match_sentences[0].length-1);
            }
            else
            {
                let match_words = trim.match(/^.*?[,\s]+\s*/);
                if(match_words && match_words[0].length <= safe_limit)
                {
                    trim = trim.substring(match_words[0].length-1);
                } // if unable to trim safely, do not trim
            }
        }
        return trim;
    }
}

This version makes three regexp matches (if necessary). Each one is matching everything from the start of the string up to (and including) the first found "separator". In the first time, the separator contains only newlines; in the second time the set is the same as yours but without spaces; and finally, it is a space and a comma. Consecutive separators are matched together, along with any occasional spaces directly after them. (Your version splits at the first found separator always).

Since you explicitly return the separator itself as the part of the string, I substring it at "matched length minus one", since all of those separators or spaces are 1-character length.

You set "trim safely" threshold to be only 20 characters, which is too low for full-line matching! I propose something as high as 300, but capping it relative to the maxlen allowed, for example at 5% – so it would be min(300, maxlen/20)

Example results:


var str = 'Once upon a time... in a Wonderland...\n\nThere was a noble warrior. He fought the darkness... Never giving up and always protecting innocent people in the kingdom of '+'X'.repeat(800);
console.log(str.length); // 965

substring_to_boundary(str,930);
'\nThere was a noble warrior. He fought <…>'

substring_to_boundary(str,890);
' Never giving up and always protecting <…>'

substring_to_boundary(str,850);
' protecting innocent people in the kingdom <…>'

substring_to_boundary(str,800);
'XXXXXXXXXXX<…>'

I don't want to make PR; just take my code fully or partially, but if you agree with me of course.

By the way, what if such "text-level" trimming (yours or mine as well) would actually tokenize differently than the original full text? Wouldn't it cause troubles?

LostRuins commented 8 months ago

@Vladonai it's actually working, but I think the problem is my token estimator is off, and the memory itself is being partially truncated too, causing poor results. I'll reduce the tokenizer limit and that should fix the issue. If you wanna test again, connect to koboldcpp as a custom endpoint from https://lite.koboldai.net and see if it works now (you can use the same koboldcpp)

@aleksusklim I think one problem with that approach is it's very inconsistent. The amount that gets removed will vary wildly, and this may lead to even more unpredictable results. It's something to keep in mind but I'm not sure its the best approach.

And yes, if the trimming ever causes different tokenizing results, then everything will have to reprocess from scratch. Trimming at string boundaries is pretty safe though, since llama never has a space in the middle of a token.

aleksusklim commented 8 months ago

It also doesn't kick in if the estimated processing length is < 256 tokens. Try conducting your experiments with 512 context length instead.

All right, I've repeated with 512, and then with 4096 to be double-sure.

So far, this is what I see:

So it can't backtrack from the end? This means the user cannot edit his lasts turns reliably, especially the model previous reply after the player already took his next turn.

If you want, I can prepare exact steps to reproduce, if it's needed.

P.S. Why I've set 4096 in both server and Lite, and also Amount to Gen. to 512, which should give 3584 for history, but in console I see BLAS for 2306 after regeneration? Are you estimating tokens!? As if Lite doesn't know actual tokens, estimates them pessimistically, cuts the text and sends to server? In this case, indeed it would be better to have a dedicated field for Memory (or a textual mark~) for Lite to send an optimistic cut (larger than context length) so that koboldccp could cut it properly token-wise. (Tried naively increasing Max Ctx. Tokens in Lite to 8096: it sent 5920 for BLAS, which took its time to process, but then Failed to predict! Check your context buffer sizes! – but why to BLAS it in the first place? At best it should shift the context as I just described; at worst – refuse to even try).

Vladonai commented 8 months ago

@Vladonai it's actually working, but I think the problem is my token estimator is off, and the memory itself is being partially truncated too, causing poor results. I'll reduce the tokenizer limit and that should fix the issue. If you wanna test again, connect to koboldcpp as a custom endpoint from https://lite.koboldai.net and see if it works now (you can use the same koboldcpp)

I tried it and now it works. But the question of modifying the history remains open. In the current implementation you can't add anything new to it and you can't modify it either - it will lead to wrong results.

aleksusklim commented 8 months ago

Lite should somehow highlight the parts that became cut (when in editable mode), so the user would understand what the model already forgot.

Looks like this is not trivial, because it is koboldcpp who shifts the context silently for Lite.

(Editing Memory would either give wrong results, or should trigger re-evaluation, since there is no way to change something in the middle of context. Well, probably truncating Memory from the end ought to work though, it might be shifted-in with new context).

LostRuins commented 8 months ago

In the next version I plan to do that - submit memory and prompt separately so it'll be easier to match against.

For now if you want to see context, use --debugmode

BarfingLemurs commented 7 months ago

(Couldn't follow the whole conversation), but Amen to this, would allow cpu users the ability to switch characters instantly rather than 10 min.

aleksusklim commented 7 months ago

Where do you switch characters? Wherever it is, it is rewriting the top of the history?

So, something like Context = System prompt + Character1 + History + Last reply + New response – And you would switch this to Character2 of the hypothetical second context, which would reevaluate only last two responses instead of the whole history?

P.S. The ability to keep more than one context simultaneously to flawlessly switch between them is a different feature; for now you probably could try to abuse multiuser mode?

BarfingLemurs commented 7 months ago

I'm sorry, I meant something simple, could we save the current cache state to here in this menu:

From the demo in llama.cpp, loading the prompt cache takes a few seconds and one can continue where they left off in the conversation. When running this on mobile, reprocessing takes time :)

LostRuins commented 7 months ago

@aleksusklim if character2 had a different memory, yes everything would be reprocessed.

@BarfingLemurs sorry, loading cached prompt from disk is not supported at this time.

aleksusklim commented 7 months ago

if character2 had a different memory, yes everything would be reprocessed.

Even in multiuser mode? (I haven't used it yet, but I assumed that different "users" will have their own context in RAM. Are they not?)

Vladonai commented 7 months ago

It's all working. The combination of "memory" and "context shift" worked out very well. Of course, models tend to ignore what comes at the beginning of the context, but I'm very happy that they are now guaranteed to get that information.

I wonder how the new "authorsnote" field mechanism works (if it is already implemented). As far as I understand, this information is inserted at the end of the context. But at the very end should be the name of the character who should respond. And if everything is simple with the "memory" field, it's not clear with "authorsnote".

aleksusklim commented 7 months ago

While testing 13B-Psyfighter2 from KoboldAI I decided to also test ContextShift on a real task.

I've extracted text from a "visual novel" game and gave it to the model, asking it to choose the next action of the list of possibilities each time (I've used Instruction+Input+Response format, where I denote with "input" the list of actions).

I put into Memory the introduction of the game, ending it with something like ### SYSTEM: Some of your past actions here are skipped due to memory limit. to make the model know it explicitly.

Each time the model reaches one of endings, I tell it that to restart and continue playing, it needs to summarize the game plot, recap its past actions, and explain future plans about how to win the game. (So that it could at least see it here when the real context would be shifted away).

It worked fairly well, with ContextShift rolling 8k context perfectly! But I see two reasons why the player would want to edit his previous turn (ultimately leading to full re-evaluation):

  1. Input may be erroneous with respect to the received output, especially with short "amount to generate". Meaning, when I asked the model to recap, it started to talk. It generated my requested amount of tokens but haven't finished its thought. I let it continue, and now I realize that I don't like what it said. From this point I want to regenerate the whole attempt (and maybe emphasizing what I really want to hear), not just the last continuation of it! But the context had already shifted.
  2. Sometimes I want to be sure that the model is comprehending the story and not just blindly copying its older replies (which look identical, because after "restarting" the "game" I give it the same scenes with the same text). To do this, I can put another kind-of "system prompt" to ask the model to stop playing (as I did with <|system|>Exit RP mode. Now … with mythalion earlier) but now explain its last action instead. This is working as expected (I see that the model indeed takes its actions on purpose, or at least pretends to do so), but since this is just a side test – I want to remove it from the story to continue the game as usual. But the context shifted!

@LostRuins, I see two methods how you could attempt to address this. The first one is an explicit "save my context to this file" local command somehow directed to koboldcpp. This way the player will be able to save as much checkpoints as he wants.

The second is multiple context-checkpoints in memory. For example, there can be three contexts, ABC. From the start, only context A is used. Just before ContextShift kicks in, the context A is forked to slot B. The shifting continues on B until there is a fair amount of new tokens generated (either set by user in server settings, or somehow estimated like SmartContext thresholds). Then the context B forks to C, and from now on, each time the same threshold triggers – the oldest context is replaced by the copy of the newest one.

When there is a context mismatch, and your code decides to regenerate everything – you check the other two contexts (from new to old), whether one of them would still be useful? If so, the chosen one replaces the newest copy, evaluates everything what was new in the prompt (including the beginning of "what have already been BLASSed but in different copy") and shifts it properly. Because it is better to re-evaluate e.g. 1024 tokens than almost 8192.

The reason of having three contexts instead of two is that with just two you cannot have a smooth transition: the change will be abrupt, and the user could still break both of them by editing right after when the switching occurred.

Personally, I still think that the saving to file is much better than rotating contexts in memory, because:

You can include in context file the JSON of the story too, so it could be loaded with the whole synchronized text! From the user's point of view, it could be understood as "special copy of the history, which is very large but can be continued instantly when loaded for the same used model" (meaning you can probe the file against the current model and koboldcpp version, and in case of mismatch – grab only JSON from there, clearly showing the warning to the user).

Yes, I know, the binary format is yet to be defined/implemented. But I think the context shifting itself had changed something in memory representation of the context in llama.cpp (for example, decoupling RoPE transformations from the data itself), right?

Isn't there a new way to swap/grab/change the context, so it can be "just" extracted as a blob?

LostRuins commented 7 months ago

Llama.cpp actually has explored some methods to snapshot the context and save to disk. However, they are not implemented in KoboldCpp for multiple reasons, mainly that it doesn't work well with existing shifting and fast forwarding methods, and also requires massive amounts of disk space - on the order of gigabytes, just to store one state. Writing and loading from disk is not as fast as you'd imagine, even on SSD it will take some time for such large amounts of data to be written and read.

aleksusklim commented 7 months ago

I disagree with both points.

Firstly, the user had ALREADY "paid" a lot of disk space to run models: each 13B model takes 7-9 Gb, and one model is never enough because each week we got new and better models! Anybody who seriously decided to follow in LLM or diffusion world – quickly understands that everything will take up disk space. Personally, I've started to buy Samsung 2TB NVMe (or SATA SSD) to fill all my available machines, both PC and laptops. (This is the most profitable choice, considering its quality, price and size).

Secondly, reading back the file that have just been written or already read – is almost as fast as copying memory, because it is still in file system cache. (As long as the application uses rational methods of reading by large chunks and not by individual bytes of course).

For example, on a rather weak notebook I can bench like this: (on a 4.77 Gb file) 7z h -bt synthia-7b-v1.2.Q5_K_M.gguf The first time it prints 12.305 (seconds) If I'll run it again, it will be 5.159 This copies the file without compression: 7z a test.7z -mx=0 -bt synthia-7b-v1.2.Q5_K_M.gguf Execution took 10.815 To confirm "read after write" case, I'll attempt 7z h -bt test.7z Result is 5.066

The only reason why FS cache won't work is the lack of free RAM, which is rather likely since the user has a loaded LLM into memory. But it all comes down to how much RAM you have? On a machine with just 16 Gb I can barely fit 13B model. And If I had 32 Gb – there would be 16 free gigabytes for the context cache, even when it is stored in files! How much of context you can fit into that? I believe its whole 2 separate contexts.

On my fast PC the same benchmark as above gives: First hashing = 3.226 Second hashing = 1.996 Archiving: 2.902 Hashing it: 2.019

The irony is, when you have a lot of memory – probably, you have a good processor too, and maybe a GPU. So that full re-evaluation is not as slow, as on a low-tier computer for which the context saving would be more beneficial.

But still, I don't believe that context export and re-import later (assuming a system crash, for example) will take more time than full BLASS even when the user doesn't have free memory to implicitly keep FS cache there.

aleksusklim commented 7 months ago

I've read the pingback from https://github.com/SillyTavern/SillyTavern/issues/1316 (Basically they want WorldInfo to be injected "somewhere near to the end" of the history, but this is "an edition of the history above the last model reply" which triggers the full re-evaluation).

@LostRuins, can ContextShift be fixed so that it won't regenerate on large edits of the history?

I think there is a threshold that aims to determine, "is this a continuation of the previous story, or is this a completely new one?", and this threshold is too pessimistic. It should allow as much as 50% or editions down the line (for example, with 4096 context, when the story is already 8096 tokens long – check for matching 6144 characters but not more). The longer the inherent context window – the better this would play.

LostRuins commented 7 months ago

So long as it's able to trim out the start part and match enough text in the middle part contextshift wont require reprocessing. The amount thats required to match is actually very small.

aleksusklim commented 7 months ago

So long as it's able to trim out the start part and match enough text in the middle part contextshift wont require reprocessing.

Okay, I decided to prove it via API. I've made a piece of javascript that sends requests for generation right from the browser. To run it, navigate to http://localhost:5001/api and open the developer console to paste the script there.

Here is the code! ```js var worldinfo_lines = 5; var target_context = 3800; async function token_count(text){ var res = await fetch("http://localhost:5001/api/extra/tokencount", { "headers": { "accept": "*/*", "content-type": "application/json", }, "body": JSON.stringify({ "prompt": text, }), "method": "POST" }).then(function(r){ return r.json(); }); return +res.value; }; async function last_perf(){ return await fetch("http://localhost:5001/api/extra/perf").then(function(r){ return r.json(); }); }; async function predict(memory,context){ var res = await fetch("http://localhost:5001/api/v1/generate", { "headers": { "accept": "*/*", "content-type": "application/json", }, "body": JSON.stringify({ "memory": memory, "prompt": context, "stop_sequence": ["\n"], "genkey": "KCPP2261", "max_context_length": 4096, "max_length": 16, "sampler_order": [6,0,1,3,4,2,5], "rep_pen_range": 1024, "rep_pen_slope": 0.7, "n": 1, "temperature": 0.85, "min_p": 0.25, "rep_pen": 1.1, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "quiet": true, "use_default_badwordsids": false }), "method": "POST" }).then(function(r){ return r.json(); }); return res.results[0].text; }; function get_random_int(){ return ('1'+Math.random()).split('.')[1].split('').reverse().join('').replace(/0+/,''); }; async function rotate_history(context,reply){ reply = reply.match(/\d+/)[0]; var lines = context.replace(/\n?\[.*\]\n?/g,'\n').trim().split('\n'); if(lines.length>worldinfo_lines){ lines.splice(lines.length-worldinfo_lines,0,'['+get_random_int()+']'); } context = lines.join('\n')+reply; var len = await token_count(context); while(len>target_context){ context = context.replace(/^.*\n/,''); len = await token_count(context); console.log(len); } return context; }; async function main(){ var memory = 'SYSTEM: The user will give you a number and you should reply with any other number that is not present anywhere in this conversation. Ignore anything in square brackets.\n'; var mem_count = await token_count(memory); console.log('MEM:',mem_count); var context = 'USER: '+get_random_int()+'\nMODEL: '+get_random_int()+'\nUSER: '+get_random_int()+'\nMODEL: '; var len = await token_count(context); while(len

What this code does:

  1. Assigns to memory field a system prompt that asks the model to output random numbers and ignore everything in square brackets. (It does not matter what numbers the model will actually print)
  2. Concatenates USER+MODEL faked conversation up until its token count becomes larger than the threshold value (currently 3800, not counting the memory length which is 38)
  3. Asks the model to predict from the current text. Max length is set to 16 tokens, and only integer number is parsed back from the output.
  4. Concatenates the response to the old context as a new line, along with the new turn prompt.
  5. Injects a number in square brackets from several lines from the end (currently 5 lines), removing the existing line in square brackets if present.
  6. Checks the length of the new context and removes the first line iteratively until the full length in tokens becomes less than the threshold (the same value 3800)
  7. Loops to point 3 forever (that is, predicting the next line again)

I checked this with koboldcpp-1.50.1 on the model sciphi-mistral-7b-32k.Q5_K_M.gguf but I don't think the model would matter here. I choose 7B just to make predictions faster. Main context length should be set to 4096 as a reasonable real-world number.

And you know what? Sometimes the BLAS reprocesses everything! But most of time is not. It is really random; I had run it about 10 times to make sure it always fails sooner or later.

It keeps printing [Context Shifting: …] along with Processing Prompt [BLAS] (175 / 175 tokes) (often between 170 and 190), but then suddenly boom – and BLAS shows 3823 tokens The experiment continues, with everything shifting correctly and reliably. Until it breaks again! (You should wait for 5-50 generations; but this is fast)

Why context shifting fails sometimes? Have I missed something in my code? How should I trim the text properly?

Actually, I was preparing more deep experiments, but even this simple loop is failing with some probability! I don't see the point in experimenting further until I understand what's going on here.

LostRuins commented 7 months ago

my guess is that your rotate history function is not behaving correctly. There may be instances where it is off by some amount, and tokens that should have been excluded remain or vice versa.

Try simplifying your example, for example just start by appending to the string instead of rotating it first.

if you want to see how context shift works under the hood, you can check out the code here https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L593

aleksusklim commented 7 months ago

There may be instances where it is off by some amount, and tokens that should have been excluded remain or vice versa.

I'm sure I am just removing the first line until it fits: (.* is not matching line-breaks in JS)

  while(len>target_context){
    context = context.replace(/^.*\n/,'');
    len = await token_count(context);
    console.log(len);
  }

Since you said that \n is safe to split as it is never tokenizing differently – I should be safe too?

start by appending to the string instead of rotating it first.

I tried! Then koboldcpp said something like "It's failed, check you buffers!" and I assumed it refuses to process anything that tokenizes larger than allowed amount.

Am I wrong?

LostRuins commented 7 months ago

Yes, you are doing something wrong.

You need to make sure that the --contextsize parameter that you launch koboldcpp with is greater than or equal to the max_context_length that is sent over the API request. This will show up as a warning in the console if not configured correctly.

If configured correctly, you can send prompts as long as you like. The truncation will happen automatically and nothing will ever overflow or fail to run.

aleksusklim commented 7 months ago

If configured correctly, you can send prompts as long as you like.

Um-m… okay. I tested it again, most simply and directly. And it worked! Just as you've said: no matter how long the context, it trims on its own internally. So, what's the problem? Maybe I did a mistake when testing?

I've tested it again under different conditions (BLAS device, length of Memory…) and managed to get the error Failed to predict! Check your context buffer sizes! again. I believe my context size is 4096 and the request specifies the same value too.

Well, assuming you are right and I am wrong, I run my previous code but with:

var worldinfo_lines = 10;
var target_context = 5000;

This will trim the context to 5000, but specifying 4096 just as the server's context size. I'll either get the correct shifting, or I will get full re-evaluation (as before), right?

But I get error about the buffer size! Here are full logs for your examination:

koboldcpp-1.50.1.exe --model sciphi-mistral-7b-32k.Q5_K_M.gguf --port 5001 --host 127.0.0.1 --launch --threads 8 --contextsize 4096 --blasbatchsize 256 –skiplauncher

Console output ``` >koboldcpp-1.50.1.exe --model sciphi-mistral-7b-32k.Q5_K_M.gguf --port 5001 --host 127.0.0.1 --launch --threads 8 --contextsize 4096 --blasbatchsize 256 –skiplauncher *** Welcome to KoboldCpp - Version 1.50.1 Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required. Initializing dynamic library: koboldcpp_openblas.dll ========== Namespace(bantokens=None, blasbatchsize=256, blasthreads=8, config=None, contextsize=4096, debugmode=0, forceversion=0, foreground=False, gpulayers=0, highpriority=False, hordeconfig=None, host='127.0.0.1', launch=True, lora=None, model='sciphi-mistral-7b-32k.Q5_K_M.gguf', model_param='sciphi-mistral-7b-32k.Q5_K_M.gguf', multiuser=False, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory='', remotetunnel=False, ropeconfig=[0.0, 10000.0], skiplauncher=True, smartcontext=False, tensor_split=None, threads=8, useclblast=None, usecublas=None, usemlock=False) ========== Loading model: C:\NN\GPT\sciphi-mistral-7b-32k.Q5_K_M.gguf [Threads: 8, BlasThreads: 8, SmartContext: False, ContextShift: True] --- Identified as LLAMA model: (ver 6) Attempting to Load... --- Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\NN\GPT\sciphi-mistral-7b-32k.Q5_K_M.gguf (version GGUF V3 (latest)) llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = sciphi_sciphi-mistral-7b-32k llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: mem required = 4893.10 MiB .................................................................................................. Automatic RoPE Scaling: Using (scale:1.000, base:10000.0). llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 512.00 MiB llama_build_graph: non-view tensors processed: 740/740 llama_new_context_with_model: compute buffer total size = 147.06 MiB Load Model OK: True Embedded Kobold Lite loaded. Starting Kobold HTTP Server on port 5001 Please connect to custom endpoint at http://127.0.0.1:5001 Input: {"memory": "SYSTEM: The user will give you a number and you should reply with any other number that is not present anywhere in this conversation. Ignore anything in square brackets.\n", "prompt": "USER: 6589112875847360\nMODEL: 5221120297854\nUSER: 55737552258767924\nMODEL: 171147791412787\nUSER: 6473628829923952\nMODEL: 113445676260994\nUSER: 754687669217837\nMODEL: 748415888935305\nUSER: 2943794556524911\nMODEL: 64963611977926924\nUSER: 749251432169676\nMODEL: 284915117750771\nUSER: 23762778361960\nMODEL: 726973414288314\nUSER: 6294815116415302\nMODEL: 831728945258813\nUSER: 2416355559566672\nMODEL: 769856832167515\nUSER: 833505019083878\nMODEL: 428935819673835\nUSER: 196433518850818\nMODEL: 694466406946425\nUSER: 6693492952250681\nMODEL: 9886538762579598\nUSER: 31153191377989\nMODEL: 528992611081949\nUSER: 4948976292578274\nMODEL: 641324003629054\nUSER: 647151845961076\nMODEL: 473586293516108\nUSER: 694665244326346\nMODEL: 325243656438017\nUSER: 89149614037735700\nMODEL: 172925058341368\nUSER: 2584934113846547\nMODEL: 535438810876705\nUSER: 1489972515969589\nMODEL: 61313015685095210\nUSER: 7145163881248354\nMODEL: 912529620891338\nUSER: 1986969199340750\nMODEL: 392667333300659\nUSER: 66362586577268881\nMODEL: 745394151580354\nUSER: 5259222947359748\nMODEL: 924666859391499\nUSER: 4733666166492483\nMODEL: 62397363865373824\nUSER: 9752783678768986\nMODEL: 819016930961027\nUSER: 699319256229545\nMODEL: 152478105285539\nUSER: 258776952588553\nMODEL: 4275479082985044\nUSER: 243631276329791\nMODEL: 7736295332426448\nUSER: 2988314542928845\nMODEL: 563196513898948\nUSER: 8389394339524729\nMODEL: 594815822316908\nUSER: 418656346732644\nMODEL: 4587852316336852\nUSER: 549984151420547\nMODEL: 4697674727525023\nUSER: 654313914600816\nMODEL: 4742493919950903\nUSER: 1621996554049821\nMODEL: 6318661835040562\nUSER: 524955549679593\nMODEL: 5922217804767002\nUSER: 4656781654151793\nMODEL: 427839920435515\nUSER: 319211200347996\nMODEL: 192359974204076\nUSER: 4272035744957720\nMODEL: 124197499768913\nUSER: 189183048487357\nMODEL: 49136154247452141\nUSER: 16864237489539210\nMODEL: 515625417462163\nUSER: 572945385808785\nMODEL: 4481827414978663\nUSER: 3952258861878379\nMODEL: 486910226996774\nUSER: 5876569231906082\nMODEL: 698476162578063\nUSER: 251882928198813\nMODEL: 364671043585596\nUSER: 16975418640787\nMODEL: 4912821759117671\nUSER: 83272557195384\nMODEL: 9295745573112777\nUSER: 426978695244746\nMODEL: 961981996398428\nUSER: 2132463855293933\nMODEL: 5965897461137514\nUSER: 926999987766368\nMODEL: 442597274669536\nUSER: 815198772437867\nMODEL: 71233591097281\nUSER: 3153265826758165\nMODEL: 7384456628567550\nUSER: 6677311435351158\nMODEL: 734859195573746\nUSER: 64583116598439864\nMODEL: 256247833009836\nUSER: 589393213265962\nMODEL: 34739236277631673\nUSER: 434427170537359\nMODEL: 872904795759777\nUSER: 757893887877249\nMODEL: 18523307781765\nUSER: 897775123887549\nMODEL: 887517159447797\nUSER: 63582277633003\nMODEL: 876231948497173\nUSER: 2364421836534466\nMODEL: 255311527852424\nUSER: 56339479702334550\nMODEL: 534151378751267\nUSER: 33034355770796210\nMODEL: 733691172768323\nUSER: 16211564648985\nMODEL: 47695153174872120\nUSER: 112631723916226\nMODEL: 1784638711295587\nUSER: 4129721463371851\nMODEL: 448679222951488\nUSER: 645333782391216\nMODEL: 155467133819039\nUSER: 8281313719284649\nMODEL: 3859701333774090\nUSER: 331125178101999\nMODEL: 92338960080935\nUSER: 651287714116996\nMODEL: 293788175446379\nUSER: 2854464637588240\nMODEL: 5924652888685462\nUSER: 724743617987274\nMODEL: 4916578932369290\nUSER: 49851798134521550\nMODEL: 527831739468322\nUSER: 1224278690343970\nMODEL: 5296913224356797\nUSER: 4258051597604860\nMODEL: 6469508497476612\nUSER: 754329054551748\nMODEL: 7158694381424794\nUSER: 751721111978474\nMODEL: 243816249188662\nUSER: 89119889554712\nMODEL: 848360204984756\nUSER: 4857527978860032\nMODEL: 6652830021988721\nUSER: 155323414839053\nMODEL: 839598492654557\nUSER: 5415381506664400\nMODEL: 643866384927243\nUSER: 749562904644214\nMODEL: 232050713935745\nUSER: 566816127332784\nMODEL: 4782566244729842\nUSER: 777657102314182\nMODEL: 45393179748428734\nUSER: 8171347836675464\nMODEL: 33282797279561\nUSER: 446588374499466\nMODEL: 945885207228935\nUSER: 4324566637175341\nMODEL: 782475600189972\nUSER: 999678232734222\nMODEL: 559986956532762\nUSER: 3578955240656490\nMODEL: 7367661787229494\nUSER: 6182794971652132\nMODEL: 5497383639689334\nUSER: 184188481168328\nMODEL: 684898728028691\nUSER: 710066863496316\nMODEL: 178561892522103\nUSER: 721286538628646\nMODEL: 6153492475927843\nUSER: 748671916689253\nMODEL: 57737673390369\nUSER: 2166581726046430\nMODEL: 914341311488645\nUSER: 767175911550435\nMODEL: 4862471794581228\nUSER: 4816153619387333\nMODEL: 6233658331291324\nUSER: 181463757661364\nMODEL: 31696266683155243\nUSER: 164894688181397\nMODEL: 313737856393608\nUSER: 26928122666465\nMODEL: 89985583667815\nUSER: 5957731164995686\nMODEL: 5188397246239\nUSER: 57542406129455\nMODEL: 586442737560384\nUSER: 133382969555969\nMODEL: 39754834854476331\nUSER: 768558798693312\nMODEL: 4356168414926503\nUSER: 4944886520099324\nMODEL: 5545782699010750\nUSER: 496868161177635\nMODEL: 21253571473618\nUSER: 852538164809696\nMODEL: 5768991864419517\nUSER: 71337773819459\nMODEL: 4841001018254991\nUSER: 766677272745725\nMODEL: 868253784796224\nUSER: 6919415487587923\nMODEL: 3835170330395410\nUSER: 4217188443239073\nMODEL: 7525358295705790\nUSER: 5751539646501820\nMODEL: 698562246650857\nUSER: 3381148171068\nMODEL: 939232606281694\nUSER: 4135792889197922\nMODEL: 96245578293568\nUSER: 26285978096562\nMODEL: 919733026233356\nUSER: 55287747514393\nMODEL: 691967756254688\nUSER: 4471579437738593\nMODEL: 762385496636856\nUSER: 923589972644988\nMODEL: 65959528338256\nUSER: 5477036817124270\nMODEL: 9475999990727340\nUSER: 696516972887026\nMODEL: 519449351602881\nUSER: 153697233971599\nMODEL: 115114517353716\nUSER: 754624618161429\nMODEL: 8329643672030112\nUSER: 749334762278195\nMODEL: 994992150647051\nUSER: 225233472858199\nMODEL: 543055627760971\nUSER: 67568833806256\nMODEL: 647739288869667\nUSER: 835826771426138\nMODEL: 94562158357263\nUSER: 695671316439557\nMODEL: 994539185333055\nUSER: 61313977567118212\nMODEL: 455090158550265\nUSER: 286579237057171\nMODEL: ", "stop_sequence": ["\n"], "genkey": "KCPP2261", "max_context_length": 4096, "max_length": 16, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "rep_pen_range": 1024, "rep_pen_slope": 0.7, "n": 1, "temperature": 0.85, "min_p": 0.25, "rep_pen": 1.1, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "quiet": true, "use_default_badwordsids": false} Processing Prompt [BLAS] (4080 / 4080 tokens) Generating (15 / 16 tokens) (Stop sequence triggered: \n) ContextLimit: 4095/4096, Processing:338.39s (82.9ms/T), Generation:2.23s (148.8ms/T), Total:340.63s (0.04T/s) Output: 75843921656179 Input: {"memory": "SYSTEM: The user will give you a number and you should reply with any other number that is not present anywhere in this conversation. Ignore anything in square brackets.\n", "prompt": "MODEL: 171147791412787\nUSER: 6473628829923952\nMODEL: 113445676260994\nUSER: 754687669217837\nMODEL: 748415888935305\nUSER: 2943794556524911\nMODEL: 64963611977926924\nUSER: 749251432169676\nMODEL: 284915117750771\nUSER: 23762778361960\nMODEL: 726973414288314\nUSER: 6294815116415302\nMODEL: 831728945258813\nUSER: 2416355559566672\nMODEL: 769856832167515\nUSER: 833505019083878\nMODEL: 428935819673835\nUSER: 196433518850818\nMODEL: 694466406946425\nUSER: 6693492952250681\nMODEL: 9886538762579598\nUSER: 31153191377989\nMODEL: 528992611081949\nUSER: 4948976292578274\nMODEL: 641324003629054\nUSER: 647151845961076\nMODEL: 473586293516108\nUSER: 694665244326346\nMODEL: 325243656438017\nUSER: 89149614037735700\nMODEL: 172925058341368\nUSER: 2584934113846547\nMODEL: 535438810876705\nUSER: 1489972515969589\nMODEL: 61313015685095210\nUSER: 7145163881248354\nMODEL: 912529620891338\nUSER: 1986969199340750\nMODEL: 392667333300659\nUSER: 66362586577268881\nMODEL: 745394151580354\nUSER: 5259222947359748\nMODEL: 924666859391499\nUSER: 4733666166492483\nMODEL: 62397363865373824\nUSER: 9752783678768986\nMODEL: 819016930961027\nUSER: 699319256229545\nMODEL: 152478105285539\nUSER: 258776952588553\nMODEL: 4275479082985044\nUSER: 243631276329791\nMODEL: 7736295332426448\nUSER: 2988314542928845\nMODEL: 563196513898948\nUSER: 8389394339524729\nMODEL: 594815822316908\nUSER: 418656346732644\nMODEL: 4587852316336852\nUSER: 549984151420547\nMODEL: 4697674727525023\nUSER: 654313914600816\nMODEL: 4742493919950903\nUSER: 1621996554049821\nMODEL: 6318661835040562\nUSER: 524955549679593\nMODEL: 5922217804767002\nUSER: 4656781654151793\nMODEL: 427839920435515\nUSER: 319211200347996\nMODEL: 192359974204076\nUSER: 4272035744957720\nMODEL: 124197499768913\nUSER: 189183048487357\nMODEL: 49136154247452141\nUSER: 16864237489539210\nMODEL: 515625417462163\nUSER: 572945385808785\nMODEL: 4481827414978663\nUSER: 3952258861878379\nMODEL: 486910226996774\nUSER: 5876569231906082\nMODEL: 698476162578063\nUSER: 251882928198813\nMODEL: 364671043585596\nUSER: 16975418640787\nMODEL: 4912821759117671\nUSER: 83272557195384\nMODEL: 9295745573112777\nUSER: 426978695244746\nMODEL: 961981996398428\nUSER: 2132463855293933\nMODEL: 5965897461137514\nUSER: 926999987766368\nMODEL: 442597274669536\nUSER: 815198772437867\nMODEL: 71233591097281\nUSER: 3153265826758165\nMODEL: 7384456628567550\nUSER: 6677311435351158\nMODEL: 734859195573746\nUSER: 64583116598439864\nMODEL: 256247833009836\nUSER: 589393213265962\nMODEL: 34739236277631673\nUSER: 434427170537359\nMODEL: 872904795759777\nUSER: 757893887877249\nMODEL: 18523307781765\nUSER: 897775123887549\nMODEL: 887517159447797\nUSER: 63582277633003\nMODEL: 876231948497173\nUSER: 2364421836534466\nMODEL: 255311527852424\nUSER: 56339479702334550\nMODEL: 534151378751267\nUSER: 33034355770796210\nMODEL: 733691172768323\nUSER: 16211564648985\nMODEL: 47695153174872120\nUSER: 112631723916226\nMODEL: 1784638711295587\nUSER: 4129721463371851\nMODEL: 448679222951488\nUSER: 645333782391216\nMODEL: 155467133819039\nUSER: 8281313719284649\nMODEL: 3859701333774090\nUSER: 331125178101999\nMODEL: 92338960080935\nUSER: 651287714116996\nMODEL: 293788175446379\nUSER: 2854464637588240\nMODEL: 5924652888685462\nUSER: 724743617987274\nMODEL: 4916578932369290\nUSER: 49851798134521550\nMODEL: 527831739468322\nUSER: 1224278690343970\nMODEL: 5296913224356797\nUSER: 4258051597604860\nMODEL: 6469508497476612\nUSER: 754329054551748\nMODEL: 7158694381424794\nUSER: 751721111978474\nMODEL: 243816249188662\nUSER: 89119889554712\nMODEL: 848360204984756\nUSER: 4857527978860032\nMODEL: 6652830021988721\nUSER: 155323414839053\nMODEL: 839598492654557\nUSER: 5415381506664400\nMODEL: 643866384927243\nUSER: 749562904644214\nMODEL: 232050713935745\nUSER: 566816127332784\nMODEL: 4782566244729842\nUSER: 777657102314182\nMODEL: 45393179748428734\nUSER: 8171347836675464\nMODEL: 33282797279561\nUSER: 446588374499466\nMODEL: 945885207228935\nUSER: 4324566637175341\nMODEL: 782475600189972\nUSER: 999678232734222\nMODEL: 559986956532762\nUSER: 3578955240656490\nMODEL: 7367661787229494\nUSER: 6182794971652132\nMODEL: 5497383639689334\nUSER: 184188481168328\nMODEL: 684898728028691\nUSER: 710066863496316\nMODEL: 178561892522103\nUSER: 721286538628646\nMODEL: 6153492475927843\nUSER: 748671916689253\nMODEL: 57737673390369\nUSER: 2166581726046430\nMODEL: 914341311488645\nUSER: 767175911550435\nMODEL: 4862471794581228\nUSER: 4816153619387333\nMODEL: 6233658331291324\nUSER: 181463757661364\nMODEL: 31696266683155243\nUSER: 164894688181397\nMODEL: 313737856393608\nUSER: 26928122666465\nMODEL: 89985583667815\nUSER: 5957731164995686\nMODEL: 5188397246239\nUSER: 57542406129455\nMODEL: 586442737560384\nUSER: 133382969555969\nMODEL: 39754834854476331\nUSER: 768558798693312\nMODEL: 4356168414926503\nUSER: 4944886520099324\nMODEL: 5545782699010750\nUSER: 496868161177635\nMODEL: 21253571473618\nUSER: 852538164809696\nMODEL: 5768991864419517\nUSER: 71337773819459\nMODEL: 4841001018254991\nUSER: 766677272745725\nMODEL: 868253784796224\nUSER: 6919415487587923\nMODEL: 3835170330395410\nUSER: 4217188443239073\nMODEL: 7525358295705790\nUSER: 5751539646501820\nMODEL: 698562246650857\nUSER: 3381148171068\nMODEL: 939232606281694\nUSER: 4135792889197922\nMODEL: 96245578293568\nUSER: 26285978096562\nMODEL: 919733026233356\nUSER: 55287747514393\nMODEL: 691967756254688\nUSER: 4471579437738593\nMODEL: 762385496636856\nUSER: 923589972644988\nMODEL: 65959528338256\nUSER: 5477036817124270\nMODEL: 9475999990727340\nUSER: 696516972887026\nMODEL: 519449351602881\nUSER: 153697233971599\nMODEL: 115114517353716\nUSER: 754624618161429\nMODEL: 8329643672030112\nUSER: 749334762278195\nMODEL: 994992150647051\nUSER: 225233472858199\nMODEL: 543055627760971\n[3253185356215361]\nUSER: 67568833806256\nMODEL: 647739288869667\nUSER: 835826771426138\nMODEL: 94562158357263\nUSER: 695671316439557\nMODEL: 994539185333055\nUSER: 61313977567118212\nMODEL: 455090158550265\nUSER: 286579237057171\nMODEL:75843921656179\nUSER: 47657703128506820\nMODEL: ", "stop_sequence": ["\n"], "genkey": "KCPP2261", "max_context_length": 4096, "max_length": 16, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "rep_pen_range": 1024, "rep_pen_slope": 0.7, "n": 1, "temperature": 0.85, "min_p": 0.25, "rep_pen": 1.1, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "quiet": true, "use_default_badwordsids": false} [Context Shifting: Erased 58 tokens at position 39] Processing Prompt [BLAS] (237 / 237 tokens) Failed to predict! Check your context buffer sizes! Output: ```
LostRuins commented 7 months ago

Hi @aleksusklim yes indeed I can repro your issue. It is actually a bit of an interesting situation.

This is the state of the KV cache from your example. kv_cache.txt

Notice the lines with pos = -1, which represent empty (unused) cells. Although there are enough for the batch of size 237, the batch cannot fit into the context as the empty space is fragmented in two places. This happened because you manually removed a different amount of tokens from the start of the prompt yourself instead of allowing the fast forwarder to do it for you.

But yes, this is kind of undesirable behavior which doesn't really have a very good solution. What I would recommend is to set your max_context_length less than 4096 while keeping the --contextsize at 4096. Perhaps setting max_context_length to about 3700 should work.

What I might do to reduce such occurrences in future would be to automatically allocate a bit of additional KV cache space, so it's more resilient against fragmentation.

aleksusklim commented 7 months ago

yes indeed I can repro your issue

Oh, thank you! I was afraid that I had already became annoying to you ))

Although there are enough for the batch of size 237, the batch cannot fit

What is "batch" here, the amout of tokens it has to process from my prompt? Can it do that twice? Or for as much as it could, fitting each minibatch to the next continuous stride.

Isn't that the problem with llama.cpp upstream too?

This happened because you manually removed a different amount of tokens from the start of the prompt yourself

Are you sure this is the reason, and not that my code is also inserting a dummy line near to the end of the history, simulating "edition of the previous turn"?

instead of allowing the fast forwarder to do it for you.

If you are matching your context in cache against new user's prompt – then it should not matter, was the prompt truncated (from the beginning) or not: because even if it is truncated, but a true match is found – then your code should behave just as if it truncated that on its own. Why it is different?

I mean, suppose the history was MEM 123 abc 456 def 789 and the new prompt is MEM 456 def 000 xyz then, the algo is:

  1. Match the maximal amout of idencical tokens from the start as the Memory (it is MEM); but if the dedicated field was set – use only that without guessing.
  2. Seek for largest prefix of the rest of new prompt inside the old history (it is 456 def)
  3. Shift everything between memory and the found offset (it is 123 abc)
  4. Discard everything after the found length in old history (it is 789)
  5. Process the rest of new prompt (it is 000 xyz)

Is this how it works? Or there are other tricky moments that make one of these steps impossible? For me, this algorithm looks very robust:

In reality, I still don't understand what the model feels when it's memory is shifted. Does it "see" the gap? Imagine a context of 16k with 1k of memory and 3k of the active history at the end: will the model "understand" that there was 12k of "something" it cannot comprehend anymore, or it would see just 4k as direct concatenation? (Testing this directly is hard since we don't have a way to save/load contexts yet…)

Perhaps setting max_context_length to about 3700 should work.

What it is even doing? How it is used in calculations, from the logic point of view? I thought it's only for trimming the text in Lite (which is inherently incorrect as long as it cannot tokenize properly when trimming). The server ignores the value when it is larger (memory not allocated), right? But what will be if it is less?

I had asked you about it in this very conversation earlier:

Your client (browser) knows the context length too, and can adjust it. For example, your server may be running 4096 while your client is still on 2048 tokens. (Why? Isn't it better to always resort to the server value? What is the point in having less context size in Lite?) You should always open Settings and move the Max Ctx. Tokens to the right, making sure numbers above and below its right part are equal.