Open SimplyCorbett opened 3 months ago
It's possible that the structure of the story makes it hard to match the correct tokens to be removed for context shifting.
Does this happen on all stories, or only this specific one? Try a different story?
Does your story have many looping or duplicated sequences that could cause an unwanted match?
Lastly, you can try disabling context shifting from the GUI or with --noshift
, can see if that fixes the issue?
Yes, using —noshift fixes the problem. The model is completely coherent. No issues at all. No repetition. Clear statements.
Meanwhile, without —noshift, this is the second reply:
“A figure stands in the doorway, a figure stands in the doorway, a figure stands in the doorway, a figure stands in the doorway, a figure stands in the doorway”… about 50 lines of this.
So like I said, how's the format of the story? For points 1 and 2 - did you try a different story, and did it have many avenues for repetition?
I started to use a Nvidia rig over the Mac and all of my issues with context shifting vanished. Though, I did start a new story. I'm 250 replies into the story and have had 0 issues and have not been forced to regenerated a reply once.
For comparison the Mac would have broken down by now and forced me to regenerate replies.
So like I said, how's the format of the story?
I just use sillytavern. Context template is default. Presets are Alpaca. Tokenizer is API.
Nevermind, the 64k context finally filled up after 450 replies. Koboldcpp is spitting out repeating answers that require regenerating the response. Context shifting still seems bugged.
I'm currently using chat competition but have noticed the same thing on text completion.
So I tried loading up the chat in llamaccp and while the answers were better I still had issues with the chat making replies to things that happened 5-6 replies prior. Stuff like that. Incoherent mess.
I had changed nothing in the chat (in terms of prompts/etc). Just replying like normal on the latest sillytavern. Once it hit the full (or near full) context length things just went pear shaped. Worked completely fine up until then.
then I tried to lower the context length from 64k to 40k, same issue.
I’m not sure how to debug this. It could be a problem with sillytavern for all I know. I’m not sure where to start.
When you say "repeating answers", could it be that your rep pen is just too low? What sampler settings are you using?
When you say "repeating answers", could it be that your rep pen is just too low? What sampler settings are you using?
I guess? But it only breaks down when the context size is almost / is fully used.
I’ve tried various settings. If you need the exact samplers I will provide them when able.
the llm responses stop making sense. It repeats things from 3-5 replies prior or outright hallucinates entirely out of context.
There may be four or five words changed in the reply and the rest is copy/paste from before.
but I can replicate this on llamacpp as well. Strangely exl2 software does not have the problem.
Yes, if you could copy and paste the output of the terminal containing a generation request and response it would be helpful
TBH I primarily use AI for RP/ERP. I'm not really that comfortable with sharing the text output of the story because of this.
Anyway, I can replicate this issue on windows, linux and mac. I can replicate it with different hardware as well. In all cases I am putting all layers onto the GPU, none of it is shared with the CPU / system RAM.
I'm not sure if you RP on koboldcpp/sillytavern, but I'd assume if you did you would be able to replicate this after 200-600 replies. I have experienced this issue on every single long story I make with long context sizes.
Since I can replicate this behavior on llamacpp as well, I'm going to assume its not specifically a koboldcpp problem.
I have been extensively testing EXL2 formats and experience none of these issues on it.
Then I would suspect it to be a model issue instead. Perhaps this model simply does not support such long context sizes.
Then I would suspect it to be a model issue instead. Perhaps this model simply does not support such long context sizes.
That was my assumption until I loaded up nemo-base-2407 (and a fine tune of it) which supports 128k context and ran into this issue with 20k context, 40k context and 64k context.
The story works flawlessly until you hit the context limit and/or reload the character card.
I confirm this happens to me too, (Model: Lamma3-70B-Euryale, win10) but not sure whether it caused by tavern or kobold.
The shift seems work better with common chat, instead of RP in tavern which often change context dramatically. So I'd better turn off the shift.
Hi!
i have a m2 max studio. Im using big tiger (which doesnt support flash attention) with 24576 context. I am on 1.73.1.
the roleplay story has 1700 replies.
When generating the initial response the output is mostly coherent and makes sense. Subsequent responses are okay for 1-3 replies and then the model starts making replies to replies from 2-7 responses ago, acting like the last reply never happened.
regenerating a reply completely breaks it.
if i load a new story in sillytavern and then go back to the old story (forcing the model to parse all context again) the reply is mostly coherent and normal.
the process repeats.
Any ideas?