Bug when prompt stored in `--prompt-cache` is longer than the new one

Edit: @DannyDaemonic here. (Sorry for the edit.) Here's a much simpler example. First build the cache like this:

./main -m /path/to/llama/bin --prompt-cache Z.cache -p "Here's a funny joke. No, wait, here's a bunch of Zs: Z Z Z Z Z Z Z Z Z Z"

Then try this:

./main -m /path/to/llama/bin --prompt-cache Z.cache -p "Here's a funny joke."

The joke will start with a Z every time. Perhaps the logits are not being reevaluated for some reason. Changing one token, even the very last, seems to work around the bug. The fix is to recalculate the logits.

What happens when a prompt stored in --prompt-cache is "longer" than the current one?

I want to make a wrapper around main.exe, but this behavior looks strange and buggy. I don't know whether it is a bug or not, but it looks very-very confusing. I may post it as a separate issue if that's really a bug that you cannot solve in this PR.

So, here are my steps: (on version ee9654138ab0ae5f138f4abddf56ca234ea3c352 and this model)

main.exe -t 6 -m WizardLM-7B-uncensored.ggml.q5_1.bin -c 512 --temp 0 --repeat_penalty 1.2 --prompt-cache cache1 -f in.txt >out.txt (note the cache file and zero temperature)

My prompt inside in.txt is this:

### Instruction:

Describe your last dream in a few words.

### Response:

My

The model outputs this text inside out.txt (with an extra space before the first line, but let's assume I stripped it manually):

### Instruction:

Describe your last dream in a few words.

### Response:

My last dream was about being chased by a group of people through a forest, but I eventually found safety inside a church.

It also creates prompt-cache file cache1 with size around 13 Mb, and writes to stderr:

main: attempting to load saved session from 'cache1'
main: session file does not exist, will create

If I repeat the same command as-is – it recreates the same text and does not update the cache file, with stderr being:

main: attempting to load saved session from 'cache1'
main: loaded a session with prompt size of 26 tokens
main: session file has exact match for prompt!

Then I copy "cache1" file to cache2 file. I also put the resulting text back into in.txt, this time cutting it after the last comma, so it becomes:

### Instruction:

Describe your last dream in a few words.

### Response:

My last dream was about being chased by a group of people through a forest,

Then I run, pointing to cache2: (which is the same as cache1 for now) main.exe -t 6 -m WizardLM-7B-uncensored.ggml.q5_1.bin -c 512 --temp 0 --repeat_penalty 1.2 --prompt-cache cache2 -f in.txt >out.txt

It gives the exact same line, continuing with but I eventually found safety inside a church. and stopping as before. But this time, cache2 is updated to 21 Mb. So far so good! Its stderr said:

main: attempting to load saved session from 'cache2'
main: loaded a session with prompt size of 26 tokens
main: session file matches 26 / 42 tokens of prompt

Finally, I copy cache2 to cache3 and cut the prompt back to My just as my very first in.txt contents. I run, pointing to cache3, and get the following:

### Instruction:

Describe your last dream in a few words.

### Response:

My butterfly had wings of fire and flew through the night sky while I chased after it, trying to catch it so that we could be together forever.

File cache3 stays binary equal to cache2, and the program outputs to stderr this lines among the others:

main: attempting to load saved session from 'cache3'
main: loaded a session with prompt size of 42 tokens
main: session file has exact match for prompt!

What just happened? For me, it looks like the program compared for the head of the prompt discarding its cached tail; then it assumed that the cache is valid (despite it is not) and continued the generation from the cache, ending up in a broken state.

My questions: 1) Is it possible to "go back" for any amounts of tokens in the cache? Can we cut the prompt in arbitrary place and keep the cache valid? Or is it only valid "after the final token", so we can go forth but cannot start earlier? 2) If we cannot go back, is it possible to "restore text" from the cache and output it explicitly? So the program would show, what we have in the cache, and what is causing its weird answers. 3) If yes, can we store "checkpoints" in the cache with full intermediate state, so that in my example after seeing "My" at the end – it would use the older state that it already had; or if I'll stop at "My last dream" then it continues up to "was about being chased by a group of people through a forest," (since that state was also recorded) and not break there? 4) If yes, why can't we "discard all of extra states" upon seeking, for example, "My last dream was dark" – backtrack to "My" (which is stored), then evaluate just "last dream was dark" and generate from here? Just as If I put a copy of an older cache myself. 5) Can't it at least throw an error with mismatching cache (or discard it and regenerate from scratch) when it detects that the prompt is different? 6) Can we specify "do not update cache" command-line option? Or "write new cache to the specified file" (which will also accept NUL to skip it) so it will not destroy the previous state? Currently my only option is to copy the cache file each time before invoking the program, which delays startup because of OS filesystem cache eviction when the model barely fits in RAM already.

And a few minor things: – Why not to run the evaluation with --n-predict 0 if I only want to cache the prompt? I have to specify --n-predict 1 to just cache the prompt, but that prevents me to use (probably reparsed?) output file since it will contain an extra word at the end. I have to use the initial file. – Why to print an extra space at the beginning of stdout? If I swap files (giving out.txt to in.txt) later, it will grow each time (two spaces, three spaces…) and most likely destroys the cache. I had to strip that manually, but this feels odd.

For me, it looks like the program compared for the head of the prompt discarding its cached tail; then it assumed that the cache is valid

Yes, I think that's how it works. As for it being "invalid", it depends on what you mean.

I didn't write the code or anything, but the way saving sessions appears to work is that the current state of the random number generator is saved with the session. This means if your prompt is a prefix of the saved one, the cached one is used. However you'll be starting generation from the RNG state after all the tokens in the saved session had been generated.

In other words, if you loaded a session with:

Now is the time for all good men to come to their aid of their

and the LLM generated party as the next token, then you loaded that session with the prompt

Now is the time

you wouldn't necessarily get for as the next token because the RNG state would from after their was generated previously.

Is it possible to "go back" for any amounts of tokens in the cache? Can we cut the prompt in arbitrary place and keep the cache valid?

If your definition of "valid" is "exactly reproducible as if this had been the initial prompt with that specific seed" then the answer is no. You'd only have that guarantee if your prompt exactly matches the one that was saved.

My changes actually can help you here a bit. If you load the session with a prefix and specify a certain seed, you will get the same tokens every time (assuming the seed wasn't negative which implies just choosing a random one).

i.e. ./main --seed 123 --prompt-cache blah.bin will give you the same result each time. note: That result won't be exactly the same as the initial generation though. What I'm saying only applies to loading the cached session.

Can't it at least throw an error with mismatching cache (or discard it and regenerate from scratch) when it detects that the prompt is different?

The current behavior is useful for anyone who doesn't require exactly reproducible results.

If you're writing a program that's interfacing with llama.cpp, one thing you could do is store an additional file with some metadata like the full initial prompt. Then your application can ensure that the prompts match and regenerate if necessary.

Can we specify "do not update cache" command-line option? Or "write new cache to the specified file" (which will also accept NUL to skip it) so it will not destroy the previous state?

This sounds like a very useful feature. I personally don't like the current behavior of overwriting the saved state without warning.

Why to print an extra space at the beginning of stdout?

Related to the previous question, it's not really a space it just gets rendered that way. Models have some special tokens that control their behavior. One is the SOD (start of document) token which LLaMA models expect at the beginning of the prompt. Generating an EOD token is also how models indicate their response is completed.

llama.cpp ensures that token always exists as the first token because it's required to get valid output. I think there have been previous issues about changing the behavior of actually showing those tokens.

Thank you @KerfuffleV2 for your answer!

the way saving sessions appears to work is that the current state of the random number generator is saved with the session.

Is it only RNG saved, and not the whole context? I mean, if the model outputs "D" after "A B C" and session is saved at "C", then I restart it at "A" and I expect "B" but it gives "E" – does this mean that it remembers "B C" (which was stored in session), or it just randomly restarts from "A"?

I didn't understand, is the older session is messing behind the scenes. I concluded that it is, because at zero temperature RNG should not randomize anything, shouldn't it?

You'd only have that guarantee if your prompt exactly matches the one that was saved.

My first idea was to just track the prompt completely in my own wrapper, but when I saw that llama.cpp can continue arbitrary tails from current cache – then I assumed that it will reasonably take any prompts: longer, shorter, different – as it is tracking the length and prints debug messages about cache state.

If it would only support 1:1 exact prompts as in cache – then I wouldn't ask my question, because caching just wouldn't work for any changed prompts. But currently it feels like the program lies about it features: it cannot process shorter prompts, failing silently.

The current behavior is useful for anyone who doesn't require exactly reproducible results.

I'm fine with not reproducible, but I'm not fine with broken! When I played with the prompts, the model sometimes gave me "How can I help you?" like it is skipped some parts of my prompt when the session was present.

I think there have been previous issues about changing the behavior of actually showing those tokens.

I don't like that I cannot swap the input with output and re-run as-is to continue generation. Currently I'm stripping the first byte, but I afraid that in next version this will be fixed, and my wrapper will fail arbitrary (for example, stripping initial spaces that really were a part of the prompt).

@aleksusklim The prompt cache stuff is new, so what's happening here isn't the intended action.

This isn't related to the RNG being restored. The RNG bug was just another oversight.

Probably should throw in a disclaimer: I'm just a random person that made a couple small contributions to this project. I'm answering to the best of my knowledge (which is far from complete) but what I say isn't official at all.

Is it only RNG saved, and not the whole context? I mean, if the model outputs "D" after "A B C" and session is saved at "C", then I restart it at "A" and I expect "B" but it gives "E"

The way most random number generators work is after you generate a number, the state of the RNG is permuted in an unpredictable way.

It's kind of like if you have a pot with a whole bunch of marbles, each marked with a number. You take a marble out, shake the pot, take a marble out, shake the pot, etc. The state of the pot at the end of generation is what gets saved to the session file.

So in your example, you're starting generation after A but you're using the state of the pot after C had been generated. Because of that, you probably won't get the same sequence of tokens after loading a session where your prompt is just a prefix of what was saved.

My pull makes it so that saved RNG state is ignored when loading the session if you specify --seed. That way, after loading a specific prefix of the saved prompt (or the whole thing) the results are deterministic: you'll get the same result every time you load the session with that seed and that prefix.

However note that this doesn't mean this applies if you make the prompt you specify longer or shorter.

When I played with the prompts, the model sometimes gave me "How can I help you?" like it is skipped some parts of my prompt when the session was present.

The examples you provided didn't show that occurring. The first question would be: Are you absolutely, 100% positive that it wasn't user error? Something like accidentally leaving that space you mentioned could cause that kind of effect. Accidentally specifying a blank prompt. Accidentally overwriting your saved session with a blank prompt, etc.

If you're positive it's not something you did wrong, then please show an (ideally reproducible) example of it happening.

Currently I'm stripping the first byte, but I afraid that in next version this will be fixed, and my wrapper will fail arbitrary (for example, stripping initial spaces that really were a part of the prompt).

It sounds like you may be better off using llama.cpp as a library, or making your own copy of the main example and tweaking it. That way, you'd have more control over the output and behavior.

The examples you provided didn't show that occurring.

I showed at temperature 0, it should not be affected by RNG in any way. Or I am wrong?

The examples you provided didn't show that occurring.

I showed at temperature 0, it should not be affected by RNG in any way. Or I am wrong?

I'm not 100% sure about that, it may depend on other sampling settings.

I was talking about where you said sometimes it skipped your prompt. RNG still having an effect when temperature is 0 may be an issue, but I wouldn't expect that to really have anything to do with session restoration.

I was talking about where you said sometimes it skipped your prompt.

For me, the clear evidence that it is not skipping anything is that it outputs expected strings at zero temp.

I did not saved my previous results (because I was just testing around and not thinking about filing a bug yet). I can try to come up with steps to show that too, if it can be triggered on zero temp, which I'll test later, on more recent commit.

Thanks to @DannyDaemonic, here's a reproducible example of some really weird prompt restoration behavior: https://github.com/ggerganov/llama.cpp/pull/1550#issuecomment-1561030560

However, I don't think it has anything to do with RNG (or at least not restoring the RNG state) because my pull allows ignoring that state and the issue still occurs.

I put some thoughts in the other thread. As suggested, I believe the application of the final logits (sampling input) and RNG to a prefix of the input leads to the unexpected sampling results (main indeed does not reevaluate anything currently if it gets an exact match). If that's the case, I think forcing at least one evaluation, say, the final token of the prompt, can help here to ensure logits are regenerated. As far as reproducibility, though, you still grapple with RNG states.

Also:

agree that more fine grained options are probably needed to control under what circumstances session/prompt caches are read and written
agree on --n-predict 0 to build a cache, I had this exact use case in examples/chat-persistent.sh
agree on the prepended space being surprising/inconvenient in stdout (acknowledging it's a requirement of the model), also had to work around that in examples/chat-persistent.sh

@ejones When saving the session, the amount of data for memory k/v, etc is based on the number of tokens that had been generated. When loading a cached prompt where the supplied prompt is only a prefix, would it be possible to truncate stuff like memory k/v to the lower number of tokens in the supplied prompt?

If so, it seems like that might fix DannyDaemon's example.

All right, I failed to replicate "broken roleplay" on zero temperature if my prompt would never be an exact prefix of the last one.

main.exe -t 8 --temp 0 -m WizardLM-7B-uncensored.ggml.q5_1.bin -f in.txt --seed 42 --prompt-cache test

I used llama-master-ac7876a-bin-win-avx2-x64 Every time my input file ends with two line-feeds.

in.txt:

Write AI's next reply in a fictional roleplay chat between User and AI. Please, write only one line of text.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: I will give you a sentence in my next reply. You should repeat it back to me verbatim but in "quotes". Did you understand me?
### Response:

outputs AI: Yes, I understand. Please give me the sentence. prints

main: attempting to load saved session from 'test'
main: session file does not exist, will create

Cache size: 48 Mb

Next one:

Write AI's next reply in a fictional roleplay chat between User and AI. Please, write only one line of text.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: I will give you a sentence in my next reply. You should repeat it back to me verbatim but in "quotes". Did you understand me?
AI: Yes, I understand. Please give me the sentence.
User: Why did the chicken cross the road?
### Response:

outputs AI: "Why did the chicken cross the road?" prints

main: attempting to load saved session from 'test'
main: loaded a session with prompt size of 96 tokens
main: session file matches 91 / 122 tokens of prompt

Cache size: 61 Mb

Third one:

Write AI's next reply in a fictional roleplay chat between User and AI. Please, write only one line of text.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: I will give you a sentence in my next reply. You should repeat it back to me verbatim but in "quotes". Did you understand me?
AI: Yes, I understand. Please give me the sentence.
User: Why did the chicken cross the road?
AI: "Why did the chicken cross the road?"
User: Yes, that is correct! Thank you. Now, tell me any joke, it will be useful for you next task later.
### Response:

outputs AI: "Why don't scientists trust atoms? Because they make up everything." prints

main: attempting to load saved session from 'test'
main: loaded a session with prompt size of 122 tokens
main: session file matches 117 / 165 tokens of prompt

Cache size: 82 Mb

Finally, repeating the second prompt:

Write AI's next reply in a fictional roleplay chat between User and AI. Please, write only one line of text.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: I will give you a sentence in my next reply. You should repeat it back to me verbatim but in "quotes". Did you understand me?
AI: Yes, I understand. Please give me the sentence.
User: Why did the chicken cross the road?
### Response:

outputs AI: "Why did the chicken cross the road?" prints

main: attempting to load saved session from 'test'
main: loaded a session with prompt size of 165 tokens
main: session file matches 117 / 122 tokens of prompt

Cache size drops back to 61 Mb

Does this mean that we can provide --prompt-cache transparently and without any side-effects, if both issues about "fixed seed" and "remembered last token when exact prefix" would be fixed and merged?

This actually renders my wrapper useless, because I assumed that partially-restoring session is technically impossible! That's why I started experimenting with cache-swapping.

But now it is enough to always provide the same cache file and don't care about it anymore, but greatly improving the execution speed. No need in explicit calls for cache building, if it is able to rebuild itself on-demand.

This is very good! Here are timings of the third prompt with cache, it took 20 seconds, including mmap 7B model loading into 8 Gb or RAM:

llama_print_timings:        load time = 11392.75 ms
llama_print_timings:      sample time =    13.67 ms /    19 runs   (    0.72 ms per token)
llama_print_timings: prompt eval time = 10356.39 ms /    48 tokens (  215.76 ms per token)
llama_print_timings:        eval time =  7174.74 ms /    18 runs   (  398.60 ms per token)
llama_print_timings:       total time = 19031.30 ms

But here are timings of the same prompt (leading to the same answer) without specifying the prompt-cache, and it took 40 seconds (and it would obviously get worse as the history would get longer)

llama_print_timings:        load time = 35239.61 ms
llama_print_timings:      sample time =    14.94 ms /    19 runs   (    0.79 ms per token)
llama_print_timings: prompt eval time = 34369.36 ms /   165 tokens (  208.30 ms per token)
llama_print_timings:        eval time =  7032.10 ms /    18 runs   (  390.67 ms per token)
llama_print_timings:       total time = 42295.04 ms

This is perfect for conversations that fit into model context (since otherwise there will be always a brand new generation with heavily trimmed prompt after each line). Fast model loading with mmap makes it unnecessary to keep the application running in interactive state (if the context loading was implemented there somehow), because it loads in a few seconds, especially on machines with a lot of RAM.

(I'm closing this issue in hope that PR would be merged; we can continue the discussion here if it feels more appropriate than in PR; I will reopen it later if I'll found another replicable inconsistency again).

@KerfuffleV2 It seems the fix to this is just to recalculate the logits.

@aleksusklim Let's leave this open until a PR that fixes this is merged, so that others who notice the issue can find an open issue.

Am I understanding correctly that ZZZ-bug is not fixed yet? Only the seed randomization was?

I tried the new commit 66874d4fbcc7866377246efbcee938e8cc9c7d76/llama-master-66874d4-bin-win-avx2-x64 and noticed this code from PR:

if (params.prompt.empty() && n_matching_session_tokens == embd_inp.size()) {
    fprintf(stderr, "%s: using full prompt from session file\n", __func__);

So it should be very useful for --prompt-cache-all to "just continue the current generation" iteratively. Let's try it out!

Contents of in.txt: (two \n at the end; is main.exe stripping the last one always?)

Write AI's next reply in a fictional roleplay chat between User and AI.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: Create a story about a bird that got lost in the magical forest.
### Response:

First invocation: main.exe -t 8 -m WizardLM-7B-uncensored.ggml.q5_1.bin --prompt-cache test -n 32 --prompt-cache-all -f in.txt Notice no specified seed, but limited generation for only 32 tokens.

stderr:

main: attempting to load saved session from 'test'
main: session file does not exist, will create

stdout:

### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: Create a story about a bird that got lost in the magical forest.
### Response:
AI: Once upon a time, there was a small bird named Blue who loved to explore the world around her. One day, while she was flying through the

Next invocation: main.exe -t 8 -m WizardLM-7B-uncensored.ggml.q5_1.bin --prompt-cache test -n 32 --prompt-cache-all Notice no input file.

stderr:

main: attempting to load saved session from 'test'
main: loaded a session with prompt size of 101 tokens
main: using full prompt from session file

stdout:

Write AI's next reply in a fictional roleplay chat between User and AI.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: Create a story about a bird that got lost in the magical forest.
### Response:
AI: Once upon a time, there was a small bird named Blue who loved to explore the world around her. One day, while she was flying through the sky, she found herself drawn towards a strange and mysterious forest. As she got closer, she realized that this was no ordinary forest - it was mag

Perfect! But suppose that I'm not happy with the result, and I want to regenerate my entire request again. I'll make an extra copy of the cache file now, it is 66 Mb.

main.exe -t 8 -m WizardLM-7B-uncensored.ggml.q5_1.bin --prompt-cache test -n 32 --prompt-cache-all -f in.txt

stderr:

main: attempting to load saved session from 'test'
main: loaded a session with prompt size of 132 tokens
main: session file has exact match for prompt!

stdout:

Write AI's next reply in a fictional roleplay chat between User and AI.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: Create a story about a bird that got lost in the magical forest.
### Response:
 magical forest, where trees whispered secrets and animals danced around fires at night. The bird, a beautiful creature with shimmering feathers,

This is exactly "broken roleplay" that I saw earlier – seems that it was triggered by this exact use-case "regenerate after continuing iteratively" (which I did by updating the cache manually via extra runs; now it is much simpler!)

The cache dropped to 50 Mb. But let's restore the previous copy of it, this time changing forest. to forest! before "### Response:" line.

stderr:

main: attempting to load saved session from 'test'
main: loaded a session with prompt size of 132 tokens
main: session file matches 63 / 70 tokens of prompt

stdout:

Write AI's next reply in a fictional roleplay chat between User and AI.
### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: Create a story about a bird that got lost in the magical forest!
### Response:
AI: Once upon a time, there was a little bird named Blue who loved to explore the world around her. One day, while flying over the magical

This is indeed a different seed (the model just kinda overfitted and generally has very high confidence levels: https://huggingface.co/ehartford/WizardLM-7B-Uncensored/discussions/10) If I re-run the same generation (restoring the previous cache), it gives slightly different output, skipping the AI: prefix:

### Instruction:
User: Hello!
AI: Greetings. How can I help you today?
User: Create a story about a bird that got lost in the magical forest!
### Response:
Once upon a time, there was a small bird named Blue who loved to explore the world around her. One day, while flying over the enchanted forest

So, everything seems to be working as intended, except for the case with EXACT prefix of the cached prompt. @DannyDaemonic, will you make another PR with "recalculating the logits", then?

BTW, can we have an option "do not print initial prompt" (not user's, nor cached when user's was empty). This way stdout would contain only the newly generated content (and without that pesky space at the beginning!) Initially I thought that --verbose-prompt aka print prompt before generation it exactly for this – printing the prompt, but actually it is more like "debug prompt". If you need another Issue for this (and for the space, and for the -n 0), I can create it too.

Am I understanding correctly that ZZZ-bug is not fixed yet?

Yes, that's correct. My changes only make it so you don't have to repetitively include the prompt when it's already saved in the session and to fix it so that --seed actually can apply when loading a cached prompt.

I saw DD's post, but I'm not really sure exactly what is required to recalculate the logits. I think doing something like part of the actual model evaluation might be required?

If you need another Issue for this (and for the space, and for the -n 0), I can create it too.

It would probably be better to put those things in a separate issue (or issues).

As for the pesky space, I'd guess you're probably pretty safe just stripping leading whitespace as a temporary workaround. I can't really imagine a situation where someone would deliberately want to put spaces at the start of a prompt. If you're taking user input to generate your prompts, you could possibly also strip leading whitespace for consistent behavior.

ggerganov / llama.cpp

Bug when prompt stored in `--prompt-cache` is longer than the new one #1585