LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.36k stars 312 forks source link

Additional prompt processing during chat. #813

Open inspir3dArt opened 2 months ago

inspir3dArt commented 2 months ago

Hi,

since I updated koboldcpp today (Termux/ Android) I have the problem that there are much more tokens processed before the LLM start's to write a reply.

Usually, after loading the the provided character card before the first response, koboldcpp just processed the tokens of my last message, before replying, now it processes between 300 and 500 tokens before replying, every time during the chat.

I usually write one or two sentences in my chat massages that have been processed really fast, and the LLM writer's in a good read while it writes speed. But now I have to wait so long for a reply to start, that doesn't make fun anymore.

Is there a way to fix that?

inspir3dArt commented 2 months ago

Looks like it's caused by using the (new?) Author's Note feature. I put a short instruction in there, telling the LLM to reply in a range of words. I was really happy to find out about that feature, because for the first time this really worked (I tried it using the system prompt section in json files before, but that never did the job consistently). Unfortunately it causes a lot of additional prompt processing. Would be really nice if that could be fixed / solved differently, like context shifting that works usually really quick.

LostRuins commented 2 months ago

If you use author's note then there will always be some reprocessing required. To reduce the amount, you can change the author note depth to strong.

inspir3dArt commented 2 months ago

Hi, thank you for your reply. I have tried the different options now.

The closest to finding a solution could be the "Author's Note" strict mode (Putting it at the end of my message like the "Stop sequences".

From what I can see by looking at what's shown in the terminal, it looks like this gets removed after the LLM reply finished, what causes that the previous LLM reply gets processed again. Isn't there a way to cut it out without a need to reprocess, or wouldn't it solve the problem not to remove it at all?

Edit: Or are there other options to motivate a LLM to write replys in a defined word or token range?

LostRuins commented 2 months ago

In that case, you can try adding the stuff to Memory instead of author's note, which will stay at a static position.