SillyTavern / SillyTavern

LLM Frontend for Power Users.
https://sillytavern.app
GNU Affero General Public License v3.0
8.28k stars 2.42k forks source link

[BUG] Koboldcpp 1.72 broken on 1.12.4 windows #2644

Closed Cypaloma closed 1 month ago

Cypaloma commented 3 months ago

Environment

🪟 Windows

System

AtlasOS Windows 10 VM

Version

1.12.4

Desktop Information

Kobodcpp with mistral nemo

Describe the problem

I posted yesterday about a problem on reddit: " I'm having a really weird error: my Sillytavern will only display the PREVIOUSLY generated message when you send for a generation. Koboldcpp runs great, is returning the correct generation in command line, and also works on its web GUI. Streaming is on and has been working, but now that this error has popped up I am not getting streaming and the previously generated message pops in as the response after the new one is finished generating. "

After a good chunk of hours troubleshooting including wiping and completely reinstalling, it seems my environment does not work with the new build of Koboldcpp (1.72) after reverting to 1.71. I know it's potentially a them problem, but the terminal from them is reading correct and their webapp is working correctly, so I'm not actually sure. Not super familiar with making bug reports, so please let me know what I have missed and I will try my best to retrieve it for you. Also worth noting is that upon freshly starting the stack, it works properly for one message.

(system is arch on intel cpu w/ a rx480 running an atlasos windows 10 VM with dedicated 3060 pass through, but I don't know if that's the culprit.)

Additional info

No response

Please tick the boxes

Cohee1207 commented 3 months ago

@LostRuins is this something that you're aware of? Could it be something related to context shifting?

Cypaloma commented 3 months ago

I was using context shifting, if that's helpful.

LostRuins commented 3 months ago

Context shifting should have no impact on this.

@Cohee1207 I have just downloaded ST 1.12.4 and do not observe this issue. Are you able to repro it?

@MableMunroe could you provide more info on how to reproduce this issue? I have tried both with and without streaming, and they seem to be running fine both in ST and Lite. If possible, a screen capture demonstrating it would be helpful. Is the incorrect reply being streamed in token by token which doesn't match the console output? You can see the tokens are they are generated with --debugmode

LostRuins commented 3 months ago

If it works with Lite but not ST that is very unusual though. The backend is essentially the same, only difference is the way the REST API is used.

Also if it triggers on the second message onwards its unlikely to be related to context shifting as that usually only happens when contexts gets filled up.

Cohee1207 commented 3 months ago

@MableMunroe do you use DRY sampler? Can we have a complete JSON of your API requests?

@LostRuins I got some reports that it might be connected to DRY sampler.

Is anyone else having problems with dry? It always stops streaming after the first or second message and whenever i swipe, it displays a copy of the previous message. Kobold does generate a different output which will get displayed on the next swipe. Some parts of the message also gets deleted at the end.

I had the same problem, weirdly changing the order of the sequence breakers somewhat fixed the streaming from it never streaming to it only failing to stream occasionally

LostRuins commented 3 months ago

I tried enabling DRY sampler on ST and although I don't get any 'previous' messages, I do seem to be getting verbatim repeats at times, in some select cases. I can't trigger it consistently, but I did get it to happen twice. image

Running in debug mode, I notice that the logit values for specific tokens are somehow at 100%, with only 1 token candidate . That basically makes the output deterministic, resulting in the same response when attempted twice. Note that I'm using default sampling values other than enabling DRY.

image image

@pi6am may want to take a look at the DRY implementation. Also, are all stateful variables like dry_sequence_breakers and dry_repeat_count and dry_max_token_repeat correctly reset and fast forwarded for each generation?

Besides that, I am not sure. It almost seems to be an artifact of DRY sampling itself. Observe how the sentence structure nearly repeats

image image

But not once have I observed output that does not match what's shown in the KCPP console. So I suppose this is an unrelated thing and not actually the same issue.

LostRuins commented 3 months ago

@pi6am also possibly related https://github.com/LostRuins/koboldcpp/issues/1043

I think DRY should be considered an 'experimental' sampler and not enabled by default.

pi6am commented 3 months ago

I tried enabling DRY sampler on ST and although I don't get any 'previous' messages, I do seem to be getting verbatim repeats at times, in some select cases. I can't trigger it consistently, but I did get it to happen twice.

I believe that Gemma 2 is prone to repetition with these settings. I tried the exact same model, card, settings, and first message, and got similar "almost repeated" output. image But in this case, DRY is doing what it's supposed to be doing. Here's an example from near the end of the output:

DRY penalties [( you 0.58)]
Generating (44 / 50 tokens) [( you 100.00%)]
DRY penalties [(' 1.01)]
Generating (45 / 50 tokens) [(' 100.00%)]
DRY penalties [(ve 0.58) (re 1.78)]
Generating (46 / 50 tokens) [(re 100.00%)]
DRY penalties [( awake 1.01) ( safe 0.58) ( asking 3.11)]
Generating (47 / 50 tokens) [( asking 99.82%) ( talking 0.16%) ( wondering 0.02%)]
DRY penalties [( about 5.44)]
Generating (48 / 50 tokens) [( about 45.02%) ( how 32.68%) ( what 14.66%) ( if 3.91%)]
DRY penalties [( my 9.52)]
Generating (49 / 50 tokens) [( the 14.58%) ( how 45.28%) ( * 37.73%) ( MY 0.73%)]
Generating (50 / 50 tokens) [( mood 95.68%) ( state 2.31%) ( forest 1.07%) ( way 0.49%)]

The penalty keeps increasing, until eventually the penalty on my is sufficient to cause the model to pick the instead of continuing the repetition.

When I turned off DRY (leaving everything else the same) I got a perfect repeat. image So while there may be bugs with DRY, I don't think it's responsible for an increase in repetition.

Going back to the original bug reports:

Sillytavern will only display the PREVIOUSLY generated message when you send for a generation. Koboldcpp runs great, is returning the correct generation in command line, and also works on its web GUI. Streaming is on and has been working, but now that this error has popped up I am not getting streaming and the previously generated message pops in as the response after the new one is finished generating.

It always stops streaming after the first or second message and whenever i swipe, it displays a copy of the previous message. Kobold does generate a different output which will get displayed on the next swipe. Some parts of the message also gets deleted at the end.

The reports are consistent about several things when the bug manifests, which all point to the bug being related to streaming:

  1. Streaming stops working.
  2. The koboldcpp console log shows that a new message was generated, but ST displays a copy of the previous message.
  3. This behavior persists on subsequent rerolls, with ST displaying the previously generated message.

In the same build that introduced DRY, I also fixed a bug that sometimes caused the last token of a response to not be streamed. There was also a change related to streaming of incomplete Unicode characters made around the same time. It's possible that one of these changes could cause streaming to occasionally break on Silly Tavern. Since I've never seen this bug in KoboldAI Lite, I wonder if there's a difference in how Silly Tavern and KoboldAI Lite handle SSE streaming that could account for the issue.

Cypaloma commented 3 months ago

Hi! I'm sorry this has been a crazy week for me, I will try and reinstall 1.72 and send you configs ASAP!

I was using DRY sampling. I was using it with Mistral Nemo finetunes maybe using chatml if i remember, and I'm positive that it was not generating the same output twice but rather different outputs every time, with one output staggered, showing the previously generated output.

Here's a verbose description of the behavior that was consistent throughout a few hours of troubleshooting:

I sent the first call for generation, watching the streaming come in as generated in ST, but I also had the Koboldcpp CLI splitscreened. For the first call, the text streams in ST, and I watch the number of tokens generated go up in the CLI, after its done the kobold CLI returns the generation in full. The generation matches with the text that is present in the sillytavern output.

The second message would error out consistently among restarts, reinstalls of both ST and koboldcpp, and with different gguf models. It would not show a single token of streamed generation, even though the kobold CLI shows tokens being generated. When the generation is completed, the koboldcpp CLI shows the correct, newly generated text.

The Sillytavern response though is whatever the previous generation was verbatim. Any further calls showed a proper generation in the kobold CLI with the correct input for the prompt (the incorrect, previous response as the last bit of the prompt) and a proper response, while sillytavern gets whatever the PREVIOUS generation was.

Hopefully that makes any sense whatsoever or is helpful, and I'm sorry I haven't been able to get back to that system yet! Thanks for the work on this project!

Cypaloma commented 3 months ago

I should also maybe note, though I could have missed it there did not seem to be any obvious configuration errors, and no warnings or errors in either the sillytavern CLI or the koboldcpp CLI, and also that it's potentially notable that it works in koboldcpp 1.71 with the exact same config, dry sampling included.

LostRuins commented 3 months ago

You say 'error out', What error do you get? It's an error on the ST console?

LostRuins commented 2 months ago

Can you try KoboldCpp 1.73, which has been just released, and see if you still encounter any streaming issues?

a1270 commented 2 months ago

Kobold 1.73 CU12 on win10 connecting to linux server for SillyTavern staging(https://github.com/SillyTavern/SillyTavern/commit/44e62156b7cfd5ed2d23673f341ae90cc142fa45) and the same problems from the 1.72 series persists for me. First message gets generated fine and all subsequent messages using ST no longer stream and will just repeat.

I have been hesitant to chime in because i haven't been able to get any useful information to help. But here goes with what i have found:

Using the KoboldAI Lite web ui works every time and i have noticed that after using that whatever seems to be getting stuck clears. Unfortunately it only clears after the second attempt and after that it starts repeating again.

Hopefully this will help explain what i am doing:

  1. generate in ST, after first reply streaming breaks and repeating happens. The message repeats even when switching chars.
  2. generate in KoboldAI Lite webui and it works fine
  3. Go back to ST and first generation is blank. I used the paper airplane button and a new user input.
  4. I then regen using the swipe arrow and a new response is generated. Streaming still doesn't work.
  5. At this point the message will now repeat until steps 2-4 are redone.

EDIT: After some more testing, including a new ST install(using the same commit) and copied just my preset, context/instruct, and a simple char. This preset seems to break streaming and cause a repeat. Rocinate.json

And i think it's the Sequence Breakers i pulled from the recommend defaults of some nemo model. It had worked fine until 1.72 and so i just defaulted to it for any custom preset.

["\n", ":", "\"", "'","*", "USER:", "ASSISTANT:", "2137:", "Dottore:", "Narrator:", "Zhur:", "Zhur", "<|im_start|>", "<|im_end|>", "<", "|", ">", "im", "end", "_", "start", "system", "USER", "ASSISTANT", "2137", "Dottore", "Narrator", "im_end", "im_start", "user", "assistant", "im_sep", "sep", "<|im_sep|>", "<|im_start|>user", "<|im_start|>assistant", "<|end|>", "_", "[INST]", "[/INST]", "[", "]", "INST"]

When i revert back to the default of ["\n", ":", "\"", "*"] it just works fine now.

LostRuins commented 2 months ago

Thank you for the config. I can repro the bug, and will be investigating further.

LostRuins commented 2 months ago

I think I have isolated the cause of this issue, it's another race condition, made worse by how long DRY takes to start generation when there are many sequence breakers. I'll do a patch release to try fix this.

LostRuins commented 2 months ago

Alright, I've just released KCPP 1.73.1, which should solve this issue. Can you please update and try again?

a1270 commented 2 months ago

I have done some testing with 1.73.1 CU12 and it seems to be working fine using the preset i provided. Streaming hasn't stopped working and the repeating hasn't happened at all. Not extensive testing but i didn't get it to trip up in the 30 messages and 4-5 char changes i did.

Thank you for fixing this.

LostRuins commented 2 months ago

Thanks. You can also try enabling DRY with lots of long sequence breakers like the above rociante config - that makes the init phase take much longer which used to trigger this bug. But it should now work correctly in all cases