Open aleksusklim opened 9 months ago
Follow-up: is it technically possible to "download" the current context history from the koboldcpp server? As text (render is server-side, if is stored as tokens in memory).
If so, can we have an action (somewhere in settings window of Lite) to replace current history in browser by its fresh copy from the server? This is needed for three cases: 1) If browser lagged badly and somehow destroyed the history (fox example when the tab gets discarded due to low RAM conditions; I saw that myself! History was blank after tab unexpected reload mid-generation; then I grabbed whatever was printed on console and stuffed it in conversation .json file, that worked and the context cache was not regenerated) 2) If the browser was disconnected before the result could be sent. Current way to recover the model response is to copy it from the console manually. Simple case: tab reload (due to low RAM again) or browser restart mid-generation. 3) If user accidentally messed up with history, which would lead to cached context mismatch and full regeneration. Downloading the actual history ensures that the next submit won't lead to BLAS re-evaluation. Also this might be useful when changing modes (from Story to Instruct and back) and be sure that no occasional non-stripped spaces would lead to context mismatch again.
Cautions:
Overall, this would be super-useful with my original on-topic suggestion about caching contexts to files: then, you could load a binary context from file when starting the server – and then get its actual text in browser to continue flawlessly!
This is probably not very suitable for KoboldCpp - the context would be massive (few gigabytes) and the cost of writing it to disk from memory or storing it somewhere will be prohibitive and slow, for something that becomes useless shortly after. Have you tried using GPU acceleration? Processing speeds increase greatly when using Cublas or CLBlast compared to without.
This could also be beneficial for group chats where all tokens are reprocessed every round. Or am I doing something wrong?
the context would be massive (few gigabytes) and the cost of writing it to disk from memory or storing it somewhere will be prohibitive and slow
Then, it should be rewritten partially, so only the last response is written at the end of the file from proper offset… Now I understand that it will require quite massive amount of code to implement properly. Along with a custom file format.
You think it is not worth it?
(Yes, I know about CLBlast, but I have different machines, including weak 16 Gb RAM + 4 Gb VRAM where full re-evaluation is a pain!)
I too think this feature could be very useful for huge models. Implementation could be like this: -usesavecache parameter at startup = save cache (by current model name) at exit (by Ctrl+C). When restarting the program - check by model name and if cache and model names coincide, then load the saved cache. This will not load the system at each request, but the speed of the first response will SIGNIFICANTLY increase.
save cache (by current model name) at exit (by Ctrl+C).
This won't help with system crash: the cache won't be saved. I'll rather opt-in to save cache before each generation, unless there is nothing changed (e.g. regeneretion of the last query).
check by model name and if cache and model names coincide, then load the saved cache
It is not enough to check only the name, rather metadata evaluation would be required (e.g. mismatched koboldcpp versions if the cache was left from old one). And still, the file format should be defined and implemented, which is not trivial.
I've checked LM Studio (https://lmstudio.ai/) and looks like they've implemented full cache saving in runtime for each dialog separately, printing used disk space.
save cache (by current model name) at exit (by Ctrl+C).
This won't help with system crash: the cache won't be saved. I'll rather opt-in to save cache before each generation, unless there is nothing changed (e.g. regeneretion of the last query).
Yes, but to be honest my Koboldcpp doesn't crash most of the time. It's a very stable program. So some light tuning is certainly desirable, but writing cache after each request will cost too much IMHO.
check by model name and if cache and model names coincide, then load the saved cache
It is not enough to check only the name, rather metadata evaluation would be required (e.g. mismatched koboldcpp versions if the cache was left from old one). And still, the file format should be defined and implemented, which is not trivial.
Well, yes, but the details of implementation are best left to the author :)
I didn't say that it's koboldcpp who is crashing, but my system! On one particular low-RAM machine I experience hangs and crashes after using koboldcpp (or any other engine to run GGML models).
For example, I can load 13B model with 16 Gb of RAM just fine (takes, for example, 15.7/16.0 used total by all processes), then close koboldcpp, and then get system hang after opening Firefox because something in memory was already corrupted! (Already tried changing my RAM, booting from another Windows disk, using Safe mode, etc.; previously it happened with 7B models on 8 Gb or RAM) Or it could crash after my workstation locks during to inactivity, unable to login again when the model is loaded in RAM.
I know that this is a completely different problem, but having the context cached would made those crashes not so painful.
On the other hand, on my hi-RAM machine (64 Gb plus 12 VRAM) not only it is not crashing whatsoever, but the full context re-evaluation is pretty fast. So, yeah… "Just get a higher-tier PC" is the solution here too.
On the other hand, on my hi-RAM machine (64 Gb plus 12 VRAM) not only it is not crashing whatsoever, but the full context re-evaluation is pretty fast. So, yeah… "Just get a higher-tier PC" is the solution here too.
Starting with the 70B-models, even on a "higher level PC" the first response at a context size of 4K has to wait too long. This is not the limit. GGUF models support context sizes up to 16K and such a context may require some solution even for a 13K model. In general, we must admit that the problem does exist.
From the perspective of a simple user, it "feels like", that a lot more time is being spent on "prompt ingestion", "prompt processing" or whatever is the first step before the actual text generation process begins.
The actual text generation speed, with streaming enabled, never "feels like" it's slow, because I can spend time actively reading the text while it is being generated.
However, if I do change tasks, scenarios, characters, or do any edit in the chat history, then the next generation will always have a very noticeable delay lag, like as if the process is hanging and frozen, it does always "feel like" it's slow, even with cuBLAS, openBLAS, CLBLAS, or whatever magical accelerator backend I choose. If the context or history is changed, then the beginning of the next generation will always "feel slow".
I don't know how could this be solved or improved, but some miracle would be good to happen. lol
Might be related: [Enhancement] Multiple instances of koboldcpp at a time using the same model, also using different models simultaneously. #163
[Question] Confused about how / why full BLAS processing happens #386
When editing response, reprocessing prompt length varies randomly #385
In the next version, there will be a brand new feature for context shifting that should allow you to significantly reduce processing time.
In the next version, there will be a brand new feature for context shifting that should allow you to significantly reduce processing time.
Thank you very much! That would be great. :)
I can reply to this from what I know. (Let's pretend that the new "context shifting" is not exists yet; still, everything I'll say will be useful!)
Notes:
ContextLimit: <used>/<total>
Max Ctx. Tokens
to the right, making sure numbers above and below its right part are equal.Amount to Gen.
is actually paying from total context size, because it is reserving some room for the model to answer. For example, if your context is 2048 but you set Amount to generate = 512, then your effective context for history will be 1536, even if the model stops early and never writes that much of text.What will happen if you run out of context?
Memory
tab, then it will be preserved as the beginning of the history. Actually, Memory it just a way to split the history into "keep this always" and "delete if needed" parts, so whenever you run out of context only a few of topmost lines are deleted (so there would be a gap between "memory" and "dialogue" as if a book was missing a page right after introduction of characters).What can you do about it?
Use SmartContext
when starting the server. It will split the history for you the next time you run out of context (and otherwise won't hurt if you'll never overflow). Actually, it will simply cut the history in two, deleting the first half (and preserving the Memory), so that each time you will continue from half-empty context window. This way you could more or less play normally, but experience full evaluations only occasionally. Just be warned that the model might forget many of older events.Context Size
. Personally, I see no quality degradation for llama2 models if I run them with 8192 (no Custom RoPE Config
!) instead of native 4096 right from the beginning. But if you do, fine, play with 4096 and restart in 8196 when you run out of context. This way you will double your window for free, and the model would still write coherent texts because it has a good history to mimic. If you hit 8192 – increase to 12288, it will still work, as long as you continue playing and not starting a new game (where it would be noticeably worse). If you hit 12288, you can try going to 16384, but unfortunately you will have to regenerate a lot, since the degradation would become obvious.To protect yourself from an accidental context reevaluation, you can:
Trim Sentences
is unchecked in Advanced tab.<|system|>
messages which you can embed at your will (for example, by adding to the end of last output), and delete if they are not needed anymore (this will re-evaluate everything below your edition, of course).As for my own confusion: LostRuins, can you explain why we see a rough token budget
instead of precise token counter and context size?
And also, what's the point of adjusting Max Ctx. Tokens
in Lite, if the server will ignore too large value, and there is no practical point in having it less than what is available currently in koboldcpp? Or is there?
3. You can enable
Use SmartContext
when starting the server. It will split the history for you the next time you run out of context (and otherwise won't hurt if you'll never overflow). Actually, it will simply cut the history in two, deleting the first half (and preserving the Memory), so that each time you will continue from half-empty context window. This way you could more or less play normally, but experience full evaluations only occasionally. Just be warned that the model might forget many of older events.
I'm interested in how this mechanism works now. Because character descriptions are usually at the beginning of the context, and if the --smartcontext parameter cuts off the upper half of the context and writes the lower half there, it turns out that the model doesn't know anything about the characters. (In reality, this is not quite true). I would just like to understand how the mechanism works - and the new one as well.
When you have an already established good history, you can switch the model to a completely different one – and most likely it would continue in the same spirit as before, because it adjusts and tries to mimic, even if it didn't know the correct chat/instruction format that you've used.
- Everything you put in Memory will stay at the top. This is implicitly used when you load a character JSON file (just check the memory and you'll see it).
In the Kobold Lite client - maybe, but what about developers of third-party applications that use API access? Here is the structure I use:
var parameters = new
{
n = 1,
max_context_length = maxTokenCount,
max_length = replyTokenCount, // 300
rep_pen = 1.19,
temperature = 1.1,
top_p = 1.0, // disabled
top_k = 0,
top_a = 0,
typical = 1.0, // disabled
tfs = 0.95,
mirostat = 2, // type
mirostat_tau = 0.5,
mirostat_eta = 0.1,
rep_pen_range = maxTokenCount,
rep_pen_slope = 1.10,
sampler_order = new int[] { 6, 0, 1, 2, 3, 4, 5 }, // defaut [6,0,1,3,4,2,5]
prompt = myPrompt,
quiet = true,
stop_sequence = stop_seq
};
The application is passed the whole prompt as it is ( prompt = myPrompt ), and then it works with it as it wants. I can double the context window and insert character descriptions after the first half, so that's definitely handled by the model. But I'd still like to know exactly how --smartcontext works :)
I think it's in https://github.com/LostRuins/koboldcpp/blob/6a4d9c26e1eeb2119171b3ea21444d940c7a8a14/model_adapter.cpp#L393 and below:
…
const int SCCtxLenThreshold = nctx * 0.8; //how much context length must be reach to trigger smartcontext
const int SCInpLenThreshold = nctx * 0.6; //how big must the input array be to trigger smartcontext
const int SCPastLenThreshold = nctx * 0.5; //how wide of a gap between the fast forwarded past and the present to trigger smart context
const float SCTruncationRatio = 0.5; //ratio for how many tokens to fast forward
const int SCTokThreshold = 32 + (nctx*0.05); //how many tokens of similarity triggers smartcontext
…
//smart context mode, detect if we have a shifted context at max length
//requirement: previous context was at least nctx/2 longer than current,
//mode is on, and current context already maxed.
…
//determine longest common substring after removing start part
int shiftamt = embd_inp.size() * SCTruncationRatio;
smartcontext = std::vector<int>(embd_inp.begin() + shiftamt, embd_inp.end());
printf("\n[New Smart Context Triggered! Buffered Token Allowance: %d]",shiftamt);
…
//if max ctx length is exceeded, chop the prompt in half after the start part, and memorize it. The memorized part becomes LCS marker.
//when a future prompt comes in, find the LCS again. If LCS > a length and LCS starts with memorized LCS
//remove all tokens between start part and start of LCS in new prompt, thus avoiding shift
//if LCS not found or mismatched, regenerate. chop new prompt and repeat from step B
…
3. When the context has no room for the next line, koboldcpp will trim the history from the beginning. This means that some of your earlier turns will effectively disappear from the memory completely. The model would never see them from now on.
4. If you have anything in
Memory
tab, then it will be preserved as the beginning of the history. Actually, Memory it just a way to split the history into "keep this always" and "delete if needed" parts, so whenever you run out of context only a few of topmost lines are deleted (so there would be a gap between "memory" and "dialogue" as if a book was missing a page right after introduction of characters).3. You can enable
Use SmartContext
when starting the server. It will split the history for you the next time you run out of context (and otherwise won't hurt if you'll never overflow). Actually, it will simply cut the history in two, deleting the first half (and preserving the Memory), so that each time you will continue from half-empty context window. This way you could more or less play normally, but experience full evaluations only occasionally. Just be warned that the model might forget many of older events.I'm interested in how this mechanism works now. Because character descriptions are usually at the beginning of the context, and if the --smartcontext parameter cuts off the upper half of the context and writes the lower half there, it turns out that the model doesn't know anything about the characters. (In reality, this is not quite true). I would just like to understand how the mechanism works - and the new one as well.
Thanks for the super detailed explanation! @aleksusklim
If "Memory" and SmartContext work in this way with the default koboldAILite UI, then as a simple user, I would very much like to know, what can I do to use this feature in SillyTavern. I mean, without coding, can I do anything to communicate to koboldcpp to treat my system prompt and characters as "Memory", and indicate somehow, when the actual chat log starts? Maybe with a modded template?
Is there any simple specific character, what indicates the end of the "Memory" field for koboldcpp? Like for example, the first |>
or the first ]
, or anything like that? @LostRuins
So far I always avoided using SmartContext, because I was very unsure about what does it do, and I absolutely wanted to be sure, that the start of my prompt (including system info task, world lore, and character descriptions) will remain intact in place, at the start of my prompt.
In the aspect of preventing characters from producing Alzheimer's disease symptoms, forgetting their own personalty, it is super important to preserve the system prompt and characters descriptions, especially if the characters are originally made-up by my imagination, not well-known characters to the used language model.
2. The model is quite capable of continuing the whatever dialogue you have so far. Even without any explicit description of your characters, the model can figure out their names, gender and species, approximate age, personality and writing style. It is trained to continue no matter what, without asking "Hey, who is that guy and how he appeared in this story?"
Yes, of course it can just go on and continue whatever is thrown at it (just as an overly smart text auto-completing trick), but for an RPG game, it is more important, that the system info task and the character descriptions are kept in place, so nothing will act out of character (even if past events/turns may get forgotten).
In the next version, there will be a brand new feature for context shifting that should allow you to significantly reduce processing time.
I've just read https://github.com/mit-han-lab/streaming-llm, and this is indeed very promising! Much better than SmartContext and plays nicely with Memory too.
It's a game changer both for this Issue (theoretical context caching) and that Issue: https://github.com/LostRuins/koboldcpp/issues/492 (having too long history in browser).
ContextShifting is Now live in v1.48. Please try.
ContextShifting is Now live in v1.48. Please try.
At first impression, the result is much more efficient than smartcontext. But: it doesn't solve the "first response" problem - the existing context has to be processed completely, which takes time. A context cache could solve this problem; "[Context Shifting: Erased 25 tokens at position 2]" - would like to be able to set the starting position to be able to save a custom "Memory". Maybe we should introduce a special tag "[KobMemEnd]" to define such a position? The user, having inserted it, will be sure that everything set before this tag will not be erased.
it doesn't solve the "first response" problem
to save a custom "Memory". Maybe we should introduce a special tag "[KobMemEnd]"
Ultimately, this will not resolve the slow initial load problem even if we'll cache everything in Memory, because, generally speaking, the memory is not as large as the main history dialogue content.
Caching should support arbitrary lengths, rolled or not.
By the way, upstream main.exe
has its own prompt cache. Maybe its format can be adopted as is, by copypasting the relevant code? (I didn't look at it myself, but I've used those cache files when --prompt-cache
option was first introduced)
The user, having inserted it, will be sure that everything set before this tag will not be erased.
Having a tag {{[MEMORY]}}
(in align with {{[INPUT]}}
and {{[OUTPUT]}}
) will indeed help third-party clients to easily support memory feature without defining it explicitly.
ContextShifting is Now live in v1.48. Please try.
@LostRuins, I tried 1.48.1 and looks like I don't understand something.
Context Size
in GUI to 256 (minimum available). I've set Max Ctx. Tokens
in Lite to 256 as well. And I've set Amount to Gen.
to 64.Adventure Mode
without Adventure Prompt
, and have put MEM
string to Memory tab.> test
as actions and watching the console.I see that:
Processing Prompt [BLAS]
after an overflow. But I'm sure it said SmartContext: False, ContextShift: True
during initialization."prompt": "MEM\n\n…"
field, without specifying it as an explicit immutable part. How it supposed to work?Doesn't the server need either of:
As for truncation, I think the text processing should be something like:
By the way, for me Trim Sentences
checkbox in Lite is still trimming the incomplete model output even if unchecked! (Console shows more text than gets into browser)
Or what should it do?
Since this my post contains several semi-unrelated observations, I can as well open a new Issue for each of them if you want, providing screenshots or console outputs.
As for Trim Sentences, make sure you're in story mode to receive the raw output. Adventure mode will always format to sentences.
2. The memory is determined automatically by testing both the old and new contexts for matching tokens at the start, and fast forwarding that portion (considered as memory). Thereafter, there's a round of context shifting to remove discarded tokens from the old context, followed by a second fast forwarding to skip past the remainder of the matched similarity tokens. Please refer to this function: https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L592
That's a great idea, except it doesn't work for some reason: [Context Shifting: Erased (num) tokens at position 2] always. And furthermore: what happens if I want to change the memory? Add a new character, change something from an old one? The memory area should be specified explicitly, IMHO.
Token truncation is done at word and sentence boundaries.
I see. I tried to propose iterative attempts to cut: given a safe length, firstly try at newlines, then try at sentence, and finally at words or characters.
For example, substring_to_boundary
can be implemented like this:
function substring_to_boundary(input_string, maxlen)
{
if(input_string.length <= maxlen)
{
return input_string;
}
else
{
let cutoff = input_string.length - maxlen;
let trim = input_string.substring(cutoff);
let safe_limit = Math.min(300, Math.floor(maxlen/20)); // no more than 300 chars or 5% of max length
let match_lines = trim.match(/^.*?\n+\s*/);
if(match_lines && match_lines[0].length <= safe_limit)
{
trim = trim.substring(match_lines[0].length-1); // always do -1 to include leading token!
}
else
{
let match_sentences = trim.match(/^.*?([.!?*")}`\];]+\s*)/);
if(match_sentences && match_sentences[0].length <= safe_limit)
{
trim = trim.substring(match_sentences[0].length-1);
}
else
{
let match_words = trim.match(/^.*?[,\s]+\s*/);
if(match_words && match_words[0].length <= safe_limit)
{
trim = trim.substring(match_words[0].length-1);
} // if unable to trim safely, do not trim
}
}
return trim;
}
}
This version makes three regexp matches (if necessary). Each one is matching everything from the start of the string up to (and including) the first found "separator". In the first time, the separator contains only newlines; in the second time the set is the same as yours but without spaces; and finally, it is a space and a comma. Consecutive separators are matched together, along with any occasional spaces directly after them. (Your version splits at the first found separator always).
Since you explicitly return the separator itself as the part of the string, I substring it at "matched length minus one", since all of those separators or spaces are 1-character length.
You set "trim safely" threshold to be only 20 characters, which is too low for full-line matching! I propose something as high as 300, but capping it relative to the maxlen allowed, for example at 5% – so it would be min(300, maxlen/20)
Example results:
var str = 'Once upon a time... in a Wonderland...\n\nThere was a noble warrior. He fought the darkness... Never giving up and always protecting innocent people in the kingdom of '+'X'.repeat(800);
console.log(str.length); // 965
substring_to_boundary(str,930);
'\nThere was a noble warrior. He fought <…>'
substring_to_boundary(str,890);
' Never giving up and always protecting <…>'
substring_to_boundary(str,850);
' protecting innocent people in the kingdom <…>'
substring_to_boundary(str,800);
'XXXXXXXXXXX<…>'
I don't want to make PR; just take my code fully or partially, but if you agree with me of course.
By the way, what if such "text-level" trimming (yours or mine as well) would actually tokenize differently than the original full text? Wouldn't it cause troubles?
@Vladonai it's actually working, but I think the problem is my token estimator is off, and the memory itself is being partially truncated too, causing poor results. I'll reduce the tokenizer limit and that should fix the issue. If you wanna test again, connect to koboldcpp as a custom endpoint from https://lite.koboldai.net and see if it works now (you can use the same koboldcpp)
@aleksusklim I think one problem with that approach is it's very inconsistent. The amount that gets removed will vary wildly, and this may lead to even more unpredictable results. It's something to keep in mind but I'm not sure its the best approach.
And yes, if the trimming ever causes different tokenizing results, then everything will have to reprocess from scratch. Trimming at string boundaries is pretty safe though, since llama never has a space in the middle of a token.
It also doesn't kick in if the estimated processing length is < 256 tokens. Try conducting your experiments with 512 context length instead.
All right, I've repeated with 512, and then with 4096 to be double-sure.
So far, this is what I see:
[Context Shifting: Erased XXX tokens at position 5]
(I have only MEM\n
in Memory) and the ContextLimit values update.So it can't backtrack from the end? This means the user cannot edit his lasts turns reliably, especially the model previous reply after the player already took his next turn.
If you want, I can prepare exact steps to reproduce, if it's needed.
P.S. Why I've set 4096
in both server and Lite, and also Amount to Gen.
to 512
, which should give 3584
for history, but in console I see BLAS for 2306
after regeneration? Are you estimating tokens!?
As if Lite doesn't know actual tokens, estimates them pessimistically, cuts the text and sends to server?
In this case, indeed it would be better to have a dedicated field for Memory (or a textual mark~) for Lite to send an optimistic cut (larger than context length) so that koboldccp could cut it properly token-wise.
(Tried naively increasing Max Ctx. Tokens
in Lite to 8096
: it sent 5920
for BLAS, which took its time to process, but then Failed to predict! Check your context buffer sizes!
– but why to BLAS it in the first place? At best it should shift the context as I just described; at worst – refuse to even try).
@Vladonai it's actually working, but I think the problem is my token estimator is off, and the memory itself is being partially truncated too, causing poor results. I'll reduce the tokenizer limit and that should fix the issue. If you wanna test again, connect to koboldcpp as a custom endpoint from https://lite.koboldai.net and see if it works now (you can use the same koboldcpp)
I tried it and now it works. But the question of modifying the history remains open. In the current implementation you can't add anything new to it and you can't modify it either - it will lead to wrong results.
Lite should somehow highlight the parts that became cut (when in editable mode), so the user would understand what the model already forgot.
Looks like this is not trivial, because it is koboldcpp who shifts the context silently for Lite.
(Editing Memory would either give wrong results, or should trigger re-evaluation, since there is no way to change something in the middle of context. Well, probably truncating Memory from the end ought to work though, it might be shifted-in with new context).
In the next version I plan to do that - submit memory and prompt separately so it'll be easier to match against.
For now if you want to see context, use --debugmode
(Couldn't follow the whole conversation), but Amen to this, would allow cpu users the ability to switch characters instantly rather than 10 min.
Where do you switch characters? Wherever it is, it is rewriting the top of the history?
So, something like Context
= System prompt
+ Character1
+ History
+ Last reply
+ New response
– And you would switch this to Character2
of the hypothetical second context, which would reevaluate only last two responses instead of the whole history?
P.S. The ability to keep more than one context simultaneously to flawlessly switch between them is a different feature; for now you probably could try to abuse multiuser mode?
I'm sorry, I meant something simple, could we save the current cache state to here in this menu:
From the demo in llama.cpp, loading the prompt cache takes a few seconds and one can continue where they left off in the conversation. When running this on mobile, reprocessing takes time :)
@aleksusklim if character2 had a different memory, yes everything would be reprocessed.
@BarfingLemurs sorry, loading cached prompt from disk is not supported at this time.
if character2 had a different memory, yes everything would be reprocessed.
Even in multiuser mode? (I haven't used it yet, but I assumed that different "users" will have their own context in RAM. Are they not?)
It's all working. The combination of "memory" and "context shift" worked out very well. Of course, models tend to ignore what comes at the beginning of the context, but I'm very happy that they are now guaranteed to get that information.
I wonder how the new "authorsnote" field mechanism works (if it is already implemented). As far as I understand, this information is inserted at the end of the context. But at the very end should be the name of the character who should respond. And if everything is simple with the "memory" field, it's not clear with "authorsnote".
While testing 13B-Psyfighter2
from KoboldAI I decided to also test ContextShift on a real task.
I've extracted text from a "visual novel" game and gave it to the model, asking it to choose the next action of the list of possibilities each time (I've used Instruction+Input+Response format, where I denote with "input" the list of actions).
I put into Memory the introduction of the game, ending it with something like ### SYSTEM: Some of your past actions here are skipped due to memory limit.
to make the model know it explicitly.
Each time the model reaches one of endings, I tell it that to restart and continue playing, it needs to summarize the game plot, recap its past actions, and explain future plans about how to win the game. (So that it could at least see it here when the real context would be shifted away).
It worked fairly well, with ContextShift rolling 8k context perfectly! But I see two reasons why the player would want to edit his previous turn (ultimately leading to full re-evaluation):
<|system|>Exit RP mode. Now …
with mythalion earlier) but now explain its last action instead. This is working as expected (I see that the model indeed takes its actions on purpose, or at least pretends to do so), but since this is just a side test – I want to remove it from the story to continue the game as usual. But the context shifted!@LostRuins, I see two methods how you could attempt to address this. The first one is an explicit "save my context to this file" local command somehow directed to koboldcpp. This way the player will be able to save as much checkpoints as he wants.
The second is multiple context-checkpoints in memory. For example, there can be three contexts, ABC. From the start, only context A is used. Just before ContextShift kicks in, the context A is forked to slot B. The shifting continues on B until there is a fair amount of new tokens generated (either set by user in server settings, or somehow estimated like SmartContext thresholds). Then the context B forks to C, and from now on, each time the same threshold triggers – the oldest context is replaced by the copy of the newest one.
When there is a context mismatch, and your code decides to regenerate everything – you check the other two contexts (from new to old), whether one of them would still be useful? If so, the chosen one replaces the newest copy, evaluates everything what was new in the prompt (including the beginning of "what have already been BLASSed but in different copy") and shifts it properly. Because it is better to re-evaluate e.g. 1024 tokens than almost 8192.
The reason of having three contexts instead of two is that with just two you cannot have a smooth transition: the change will be abrupt, and the user could still break both of them by editing right after when the switching occurred.
Personally, I still think that the saving to file is much better than rotating contexts in memory, because:
You can include in context file the JSON of the story too, so it could be loaded with the whole synchronized text! From the user's point of view, it could be understood as "special copy of the history, which is very large but can be continued instantly when loaded for the same used model" (meaning you can probe the file against the current model and koboldcpp version, and in case of mismatch – grab only JSON from there, clearly showing the warning to the user).
Yes, I know, the binary format is yet to be defined/implemented. But I think the context shifting itself had changed something in memory representation of the context in llama.cpp (for example, decoupling RoPE transformations from the data itself), right?
Isn't there a new way to swap/grab/change the context, so it can be "just" extracted as a blob?
Llama.cpp actually has explored some methods to snapshot the context and save to disk. However, they are not implemented in KoboldCpp for multiple reasons, mainly that it doesn't work well with existing shifting and fast forwarding methods, and also requires massive amounts of disk space - on the order of gigabytes, just to store one state. Writing and loading from disk is not as fast as you'd imagine, even on SSD it will take some time for such large amounts of data to be written and read.
I disagree with both points.
Firstly, the user had ALREADY "paid" a lot of disk space to run models: each 13B model takes 7-9 Gb, and one model is never enough because each week we got new and better models! Anybody who seriously decided to follow in LLM or diffusion world – quickly understands that everything will take up disk space. Personally, I've started to buy Samsung 2TB NVMe (or SATA SSD) to fill all my available machines, both PC and laptops. (This is the most profitable choice, considering its quality, price and size).
Secondly, reading back the file that have just been written or already read – is almost as fast as copying memory, because it is still in file system cache. (As long as the application uses rational methods of reading by large chunks and not by individual bytes of course).
For example, on a rather weak notebook I can bench like this: (on a 4.77 Gb file)
7z h -bt synthia-7b-v1.2.Q5_K_M.gguf
The first time it prints 12.305
(seconds)
If I'll run it again, it will be 5.159
This copies the file without compression:
7z a test.7z -mx=0 -bt synthia-7b-v1.2.Q5_K_M.gguf
Execution took 10.815
To confirm "read after write" case, I'll attempt
7z h -bt test.7z
Result is 5.066
The only reason why FS cache won't work is the lack of free RAM, which is rather likely since the user has a loaded LLM into memory. But it all comes down to how much RAM you have? On a machine with just 16 Gb I can barely fit 13B model. And If I had 32 Gb – there would be 16 free gigabytes for the context cache, even when it is stored in files! How much of context you can fit into that? I believe its whole 2 separate contexts.
On my fast PC the same benchmark as above gives:
First hashing = 3.226
Second hashing = 1.996
Archiving: 2.902
Hashing it: 2.019
The irony is, when you have a lot of memory – probably, you have a good processor too, and maybe a GPU. So that full re-evaluation is not as slow, as on a low-tier computer for which the context saving would be more beneficial.
But still, I don't believe that context export and re-import later (assuming a system crash, for example) will take more time than full BLASS even when the user doesn't have free memory to implicitly keep FS cache there.
I've read the pingback from https://github.com/SillyTavern/SillyTavern/issues/1316 (Basically they want WorldInfo to be injected "somewhere near to the end" of the history, but this is "an edition of the history above the last model reply" which triggers the full re-evaluation).
@LostRuins, can ContextShift be fixed so that it won't regenerate on large edits of the history?
I think there is a threshold that aims to determine, "is this a continuation of the previous story, or is this a completely new one?", and this threshold is too pessimistic. It should allow as much as 50% or editions down the line (for example, with 4096 context, when the story is already 8096 tokens long – check for matching 6144 characters but not more). The longer the inherent context window – the better this would play.
So long as it's able to trim out the start part and match enough text in the middle part contextshift wont require reprocessing. The amount thats required to match is actually very small.
So long as it's able to trim out the start part and match enough text in the middle part contextshift wont require reprocessing.
Okay, I decided to prove it via API.
I've made a piece of javascript that sends requests for generation right from the browser. To run it, navigate to http://localhost:5001/api
and open the developer console to paste the script there.
What this code does:
I checked this with koboldcpp-1.50.1
on the model sciphi-mistral-7b-32k.Q5_K_M.gguf
but I don't think the model would matter here. I choose 7B just to make predictions faster.
Main context length should be set to 4096 as a reasonable real-world number.
And you know what? Sometimes the BLAS reprocesses everything! But most of time is not. It is really random; I had run it about 10 times to make sure it always fails sooner or later.
It keeps printing [Context Shifting: …]
along with Processing Prompt [BLAS] (175 / 175 tokes)
(often between 170 and 190), but then suddenly boom – and BLAS shows 3823 tokens
The experiment continues, with everything shifting correctly and reliably. Until it breaks again! (You should wait for 5-50 generations; but this is fast)
Why context shifting fails sometimes? Have I missed something in my code? How should I trim the text properly?
Actually, I was preparing more deep experiments, but even this simple loop is failing with some probability! I don't see the point in experimenting further until I understand what's going on here.
my guess is that your rotate history function is not behaving correctly. There may be instances where it is off by some amount, and tokens that should have been excluded remain or vice versa.
Try simplifying your example, for example just start by appending to the string instead of rotating it first.
if you want to see how context shift works under the hood, you can check out the code here https://github.com/LostRuins/koboldcpp/blob/concedo/gpttype_adapter.cpp#L593
There may be instances where it is off by some amount, and tokens that should have been excluded remain or vice versa.
I'm sure I am just removing the first line until it fits: (.*
is not matching line-breaks in JS)
while(len>target_context){
context = context.replace(/^.*\n/,'');
len = await token_count(context);
console.log(len);
}
Since you said that \n is safe to split as it is never tokenizing differently – I should be safe too?
start by appending to the string instead of rotating it first.
I tried! Then koboldcpp said something like "It's failed, check you buffers!" and I assumed it refuses to process anything that tokenizes larger than allowed amount.
Am I wrong?
Yes, you are doing something wrong.
You need to make sure that the --contextsize
parameter that you launch koboldcpp with is greater than or equal to the max_context_length
that is sent over the API request. This will show up as a warning in the console if not configured correctly.
If configured correctly, you can send prompts as long as you like. The truncation will happen automatically and nothing will ever overflow or fail to run.
If configured correctly, you can send prompts as long as you like.
Um-m… okay. I tested it again, most simply and directly. And it worked! Just as you've said: no matter how long the context, it trims on its own internally. So, what's the problem? Maybe I did a mistake when testing?
I've tested it again under different conditions (BLAS device, length of Memory…) and managed to get the error Failed to predict! Check your context buffer sizes!
again.
I believe my context size is 4096 and the request specifies the same value too.
Well, assuming you are right and I am wrong, I run my previous code but with:
var worldinfo_lines = 10;
var target_context = 5000;
This will trim the context to 5000, but specifying 4096 just as the server's context size. I'll either get the correct shifting, or I will get full re-evaluation (as before), right?
But I get error about the buffer size! Here are full logs for your examination:
koboldcpp-1.50.1.exe --model sciphi-mistral-7b-32k.Q5_K_M.gguf --port 5001 --host 127.0.0.1 --launch --threads 8 --contextsize 4096 --blasbatchsize 256 –skiplauncher
Hi @aleksusklim yes indeed I can repro your issue. It is actually a bit of an interesting situation.
This is the state of the KV cache from your example. kv_cache.txt
Notice the lines with pos = -1, which represent empty (unused) cells. Although there are enough for the batch of size 237, the batch cannot fit into the context as the empty space is fragmented in two places. This happened because you manually removed a different amount of tokens from the start of the prompt yourself instead of allowing the fast forwarder to do it for you.
But yes, this is kind of undesirable behavior which doesn't really have a very good solution. What I would recommend is to set your max_context_length
less than 4096 while keeping the --contextsize
at 4096. Perhaps setting max_context_length
to about 3700 should work.
What I might do to reduce such occurrences in future would be to automatically allocate a bit of additional KV cache space, so it's more resilient against fragmentation.
yes indeed I can repro your issue
Oh, thank you! I was afraid that I had already became annoying to you ))
Although there are enough for the batch of size 237, the batch cannot fit
What is "batch" here, the amout of tokens it has to process from my prompt? Can it do that twice? Or for as much as it could, fitting each minibatch to the next continuous stride.
Isn't that the problem with llama.cpp upstream too?
This happened because you manually removed a different amount of tokens from the start of the prompt yourself
Are you sure this is the reason, and not that my code is also inserting a dummy line near to the end of the history, simulating "edition of the previous turn"?
instead of allowing the fast forwarder to do it for you.
If you are matching your context in cache against new user's prompt – then it should not matter, was the prompt truncated (from the beginning) or not: because even if it is truncated, but a true match is found – then your code should behave just as if it truncated that on its own. Why it is different?
I mean, suppose the history was
MEM 123 abc 456 def 789
and the new prompt is
MEM 456 def 000 xyz
then, the algo is:
MEM
); but if the dedicated field was set – use only that without guessing.456 def
)123 abc
)789
)000 xyz
)Is this how it works? Or there are other tricky moments that make one of these steps impossible? For me, this algorithm looks very robust:
In reality, I still don't understand what the model feels when it's memory is shifted. Does it "see" the gap? Imagine a context of 16k with 1k of memory and 3k of the active history at the end: will the model "understand" that there was 12k of "something" it cannot comprehend anymore, or it would see just 4k as direct concatenation? (Testing this directly is hard since we don't have a way to save/load contexts yet…)
Perhaps setting max_context_length to about 3700 should work.
What it is even doing? How it is used in calculations, from the logic point of view? I thought it's only for trimming the text in Lite (which is inherently incorrect as long as it cannot tokenize properly when trimming). The server ignores the value when it is larger (memory not allocated), right? But what will be if it is less?
I had asked you about it in this very conversation earlier:
Your client (browser) knows the context length too, and can adjust it. For example, your server may be running 4096 while your client is still on 2048 tokens. (Why? Isn't it better to always resort to the server value? What is the point in having less context size in Lite?) You should always open Settings and move the
Max Ctx. Tokens
to the right, making sure numbers above and below its right part are equal.
I mean, the context of "current" generation, that is super-fast for regeneration of latest action, and for minor editing of history.
I propose adding an option/param, that points to local binary file. If existed, koboldcpp should read the context from there. While this option is active, koboldcpp should update this file with each new context after generation. (Read once at start, rewrite/append during runtime).
I can name two main reasons, why this will be extremely useful:
Yes, I understand that any accidental move – and I can easily destroy the cache (loading wrong history, adding a space from the start of text, etc.) which would ultimately lead to full regeneration. But! If I would play locally and alone, that would be only my own fault. Avoiding that, I can restart my system anytime, and continue playing instantly later.
Also, for use-case about big story templates, you might give an additional option for this context to be read-only, as in https://github.com/ggerganov/llama.cpp/pull/1640