LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5k stars 349 forks source link

ContextShift sometimes degrades output #550

Closed h3ndrik closed 3 months ago

h3ndrik commented 10 months ago

I'm trying storywriting with KoboldCpp. At some point the story will get longer than the context and KoboldCpp starts evicting tokens from the beginning, with the (newer) ContextShift feature. Sometimes this degrades the output significantly. It will get into repetition-loops, barely write correct sentences and forget who is doing what. That happens after the KV cache got messed with. The story had been fine for the first 4k tokens (or so, dependant on context size) before.

Does this also happen to other people? I'm not sure if it does this every time. I'm pretty sure it doesn't always happen. Other times it keeps generating high-quality output. I've changed too many settings simultaneously and tried different models, so I can't really pin it down and make a good statement.

I'm not sure what I'm doing wrong or if this is a bug. I have also set RoPE scaling and I regularly edit the last paragraphs before (re)generating more output.

(I really like the speedup with ContextShift, so disabling it won't be an option.)

Environment and Context

Platform: Linux (Debian), CPU only KoboldCpp: On branch concedo, commit 0ca814e

$ make clean && make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1
$ python3 koboldcpp.py --threads 2 --contextsize 8192 --port 5001 models/LLaMA2-13B-Psyfighter2.Q4_K_M.gguf

Steps to Reproduce

Not sure. I'd like to hear other people's experiences. Generate long output, past the (initial) context window. Keep generating more and more text.

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Z95070 commented 10 months ago

I usually run into this sort of thing in ST +KCPP when using functions like "Continue" twice on the same message, or editing semi-recent messages to take an alternate plot path or something like that. It's happened far less after the recent KV cache changes though, and it never seems to happen under normal use, only when I start getting ballsy with mass edits and weird 3rd party functions.

LostRuins commented 10 months ago

May be worth to keep an eye out on the number of tokens contextshift has evicted. Also you should compare for cases where contextshift is disabled.

Are you running 1.51.1?

h3ndrik commented 10 months ago

Are you running 1.51.1?

Yes, sorry I just wrote the commit ref. That is 1.51.1. I pull like once per week. Don't want to miss out on the nice stuff you implement ;-)

May be worth to keep an eye out on the number of tokens contextshift has evicted. Also you should compare for cases where contextshift is disabled.

Sure, I need a more methodical approach to dig down further. Why do you say I should keep an eye out on the numbers? What am I looking for? And does it matter if I let it evict 100 tokens or 400? And is there a difference between evicting 200 tokens once, or 50 tokens four times in a row? I'm sorry I have a bit of a hard time understanding all of this. Machine Learning isn't really my field of expertise and this is rather advanced stuff once it gets down to the specific implementation.

I suppose I use that feature like most people would, to "stream" longer output and re-use the KV cache. I've read the StreamingLLM paper I suppose "ContextShift" plays out a bit differently for that use-case? How does this feature compare to their findings? Judging by KoboldCpp's debug output I'd say you don't keep 4 "attention sinks" around? It tells me it evicts starting at token 2.

Edit: I had a quick look at the code. Seems the implementation just shifts around the sequence numbers in the cache and doesn't touch any values. Doesn't that practically boil down to window attention (without re-computation) from the Streaming-LLM paper? ( https://github.com/mit-han-lab/streaming-llm/issues/33 )

bucketcat commented 9 months ago

This is the default tokenizer used in Llama.cpp being shite and broken. Something about the implementation affects thing outside of just tokenization. Using GPT-2 or NAI through ST resolves this, but often breaks context shifting. I have brought this up many times privately with lostruins, but pinpointing the exact issue is a bit hard. Try using a different tokenizer and it should resolve the issues.

In my case, it starts removing the word "the" and shortening sentences a lot, becoming very blunt.

TL;DR, don't use Lite, don't use auto/api tokenizer.

aleksusklim commented 9 months ago

@LostRuins, can Lite (optionally?) not strip the text on itself, always sending the whole history to the backend? Wouldn't that resolve the issue with tokenizer?

h3ndrik commented 9 months ago

In my case, it starts removing the word "the" and shortening sentences

That has also happened to me.

the default tokenizer used in Llama.cpp being shite and broken

What kind of magic does the tokenizer do? I thought tokinizing was a very straightforward operation? And how do I change a tokenizer in KoboldCpp?

can Lite (optionally?) not strip the text on itself, always sending the whole history to the backend? Wouldn't that resolve the issue with tokenizer?

I don't think that works. The context window has a fixed size, you can't fit more than that into it.

aleksusklim commented 9 months ago

I don't think that works. The context window has a fixed size, you can't fit more than that into it.

Koboldcpp backend should do the trimming (token-wise), not Lite client (by words). More discussions: https://github.com/LostRuins/koboldcpp/issues/445#issuecomment-1829614232

If the problem persists, we would know that it's not Lite's fault, but either a bug in koboldcpp or in llama.cpp upstream library.

LostRuins commented 9 months ago

@aleksusklim there is nothing wrong with the tokenizer, as far as I can tell. You can view the tokens in context with --debugmode. It is not possible to use a different tokenizer from any frontend, prompts are sent as a string and only tokenized on the backend. you can only choose where to truncate the prompt before you send it.

h3ndrik commented 9 months ago

I'm sure this is not the frontend.

I found a similar bugreport in llama.cpp: https://github.com/ggerganov/llama.cpp/issues/4097

kaetemi commented 7 months ago

Found this thread earlier while looking at the same issue in llamacpp. If I do any long chat conversation with context shifting it does indeed end up repeating endlessly. However, it doesn't seem to be entirely inherent to the context shifting itself (although maybe just partially, since there's more garbage looping into the calculations already). If I re-run the same prompt tokens as-is after they were shifted in a blank slot, the output is already quite degraded as well.

To me it seems to be mostly a matter of excessive repeated use of tokens from the output going into a feedback loop. I'm not sure about increasing token penalties, since that affects other important tokens as well. I solved it in my application by just tracking the last 18 sentences or lines that were written by the model. Then whenever a new complete sentence is generated, I check if it has at least 50% fresh tokens compared to each of those 18 lines individually. (So, if any of the last 18 sentences is more than 50% the same as the new one, the new line gets rejected.) This seems to work well so far in my testing, without affecting prompt-related tokens. Output after many context shifts is staying very coherent with that filter in place. Even works well for long generation sequences with no user input.

MoonRide303 commented 6 months ago

Not all windowing methods are able to prevent degradation of perplexity. But results from attention sinks (StreamingLLM) method look pretty nice - it looks like worth implementing:

image

I've found open issue for implementing this in llama.cpp, here: https://github.com/ggerganov/llama.cpp/issues/3440.

h3ndrik commented 6 months ago

I've found open issue for implementing this in llama.cpp, here: ggerganov#3440.

llama.cpp has implemented this now. I suppose it's just leaving the first n tokens (4) in place when shifting all the context.

MoonRide303 commented 6 months ago

I've found open issue for implementing this in llama.cpp, here: ggerganov#3440.

llama.cpp has implemented this now. I suppose it's just leaving the first n tokens (4) in place when shifting all the context.

I am not sure if it would work exactly as in the paper (it would required digging into original and transfromers implementations, then comparing it with how it's done in llama.cpp, then running something like 1M tokens test), but yeah, using 4 or 8 initial tokens is part of this method. To work as described in the paper setting up an attention sink(s) (n_discard) is also required - but I don't see this option available as launch parameter for main or server.

aleksusklim commented 6 months ago

Aren't the Memory dedicated field is doing the work?

Just put there four newlines and you're safe. In case you have an actual memory content (which you already should, because otherwise how do you think the model would behave regarding its system prompt?), everything will work as-is in koboldcpp.

h3ndrik commented 6 months ago

I am not sure if it would work exactly as in the paper

Sure, I've had disagreements about that paper before. I think just the four tokens will solve this problem though. The rest of StreamingLLM is probably never going to get implemented. They closed the issue in llama.cpp last week. And there doesn't seem much interest in this method in general.

Just put there four newlines and you're safe.

Thanks. That's a good idea and somehow skipped my mind. However it'd probably be nice to have this as a setting which is enabled per default if there isn't anything pinned to the first tokens. As I think this is kind of a not obvious workaround. And messing with the first tokens probably never works well?!

aleksusklim commented 6 months ago

As I think this is kind of a not obvious workaround.

This is much more straightforward than pinning random empty tokens at the beginning of the context! Most of the time your context looks like this:

  1. System prompt ("You are a helpful assistant…")
  2. First question ("How do I…")
  3. First answer ("To do this you need to…")
  4. Follow-up questions ("But I said that I want…")
  5. etc.

Or, for roleplay, you get:

  1. System prompt ("Enter RP mode…")
  2. Character cards ("John is a demon-figher…")
  3. Scene and story settings ("Act as John in first-person from now on…")
  4. Your turn ("You found yourself in the middle of…")
  5. Model's turn ("I glance around and see…")

Do you really think just "keeping 4 tokens from the start" would not cause degradation of quality? The model will no longer see its system prompt anymore, and would have to infer the needed behavior in a few-shor / in-context manner!

I believe for question-answering you should pin either just the system prompt, or system + a few on-topic question-answer pairs. (In case of empty system prompt – then yes, you'll need at least 4 of "something" there…) And for the roleplay mode – definitely system + all character cards + story settings.

Personally, I did exactly this just when ContextShift came out, and it worked well (except for accidental re-evaluations). Then after Yi-34b the context window was so large that there was no need in shifting for regular usage (but for a few specific cases I thought it could be useful, for example if the model should somehow play the same visual novel over and over – to challenge it to win the game).

Now I use only miqu/mixtral, and the context at its max of 64k is more than enough for any possible application for me! (And probably could be further extended to 128k right away by lifting that artificial limit)

MoonRide303 commented 6 months ago

To verify if context shifting works properly it would be the best to add a proper test case for it - like 1M+ token conversation (StreamingLLM w/ Attention Sinks can do that - tested by MIT up to 4M+ conversations). If implemented right models should be able to continue reasonably predict tokens (low and steady perplexity, memory usage, and compute).

It should work even with small contexts like 4k, a bit like we talk as humans (we don't remember every single word, but rather focus our attention on most relevant information - and continuously discard the rest of it as conversations get longer).

aleksusklim commented 6 months ago

The last time I tried to make an external test case it failed because of regenerations: https://github.com/LostRuins/koboldcpp/issues/445#issuecomment-1830377096 and up.

Maybe it's worth checking it again, but as I said – I don't see reasons to use context shifting anymore because now we have models with very long contexts.

MoonRide303 commented 6 months ago

Those are not mutually exclusive things. IMO we should have both large contexts, AND being able to have infinite conversations - like we have sometimes have with other humans, where single turn of conversation could be a pretty long letter (like 10-20 pages of text).

Currently it feels really bad when you have long and interesting conversation, and then model collapses (predicts crap tokens, stucks and repeats, etc.).

Please also mind a lot of people use consumer grade cards, with just 16GB of VRAM, or even less. Which means all they can use are usually quantized 7B to 13B models, with much smaller contexts (8k to 32k for finetunes of Mistral).

aleksusklim commented 6 months ago

you have long and interesting conversation, and then model collapses

You mean because of broken shifting or without shifting at all? For old models (llama2) you either hit the limit and regenerate, or set your context high (16k) and get really bad output right away, even before you'd fill that.

with just 16GB of VRAM

I do not use GPU offloading anymore: https://github.com/LostRuins/koboldcpp/issues/737#issuecomment-2000135901 Between "the model will be fast but dumb" (small "b" of parameters or small "k" of context length) and "the model will be smart but slow" I choose the latter.

Given speeds like that (1-2 tokens per second for 64k with largest Mixtral, slowing further as you fill the context) it is unfeasible to test context shifting for real. Thus we have to prove it's working at least for smaller models and small contexts (for which it was easier to erroneously trigger a full reevaluation). Maybe it's worth another try, I will run my experiment again to see how v1.62.1 performs currently…

Without an ability to save and load contexts at will – it would be hard to pinpoint any found bugs though, because you might not reproduce them again in the next run.

h3ndrik commented 5 months ago

Do you really think just "keeping 4 tokens from the start" would not cause degradation of quality?

Yes, I think that's one of the main findings of the Streaming-LLM paper.

Sure. The context get's shifted out and dropped and becomes unavailable to the model. But it were nice if the model continued to generate legible text.

I don't see reasons to use context shifting anymore because now we have models with very long contexts.

I'd like to have a chatbot and continue talking to it. And do storywriting (novels). Even 16k (or 32k) is finite and I hit that at some point. ContextShift is super useful for that. Also I don't have an infinite amount of RAM for a super large KV cache. I myself wouldn't want to return to the times where each time I hit the context limit I'd need to wait several minutes for each subsequent reply.

h3ndrik commented 3 months ago

I'm going to close this now. I'm not sure if it's solved, but I didn't encounter this bug for quite some time. Either it's become better or my usage pattern has changed. Anyways, thanks for the great software.