LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.99k stars 349 forks source link

Regenerating context/chat every 3-4 replies with SillyTavern & API #756

Closed SimplyCorbett closed 6 months ago

SimplyCorbett commented 6 months ago

I'm using the API in sillytavern. I have a reverse proxy setup through nginx for the outside world.

When using this setup every 2-3 replies requires reprocessing what appears to be the entire story which can take a while.

This wasn't happening with sillytavern when using koboldcpp on LAN.

I also enabled the multiuser interface. I'm going to assume the issue is with the multiuser interface.

Full command:

./koboldcpp.py --model /Volumes/fastarray/models/mixtral/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf --launch --noblas --gpulayers 100 --contextsize 6144 --threads 11 --usemlock --quiet --password passwordhere --port porthere --host 127.0.0.1 --multiuser 3

LostRuins commented 6 months ago

I don't think a proxy will affect regeneration behavior. That's entirely based on what's in the context. Did you resolve this issue?