LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.26k stars 360 forks source link

Koboldcpp 1.54 regression #602

Closed Mintberry1 closed 7 months ago

Mintberry1 commented 10 months ago

HW: AMD 7800X3D, rtx 4090, 64 gb ram, windows 11

I am using TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF + SillyTavern (latest version) (offloaded 26 layers. 15 cpu threads.) (total VRAM used: 23132.75 MiB (model: 20347.44 MiB, context: 2785.32 MiB)

(I have had slow experience with 1.52 for Mixtral. When the fixes for mixtral where not added.) Startup and speed with 1.53 increased and was very good. Since 1.54 i have problems. Ram usage is higher 62 Gb at beginning and making the system laggy and being less performant. Today i started a chat with Silly Tavern and after some messages the system froze at intervals (mouse), i closed koboldcpp and the frozen mouse was still occuring. Restarted and reverted to 1.53. The cards and existing chats load faster, more performant, and no high ram usage. (40 gb instead of 62). It seems something in 1.54 is perhaps causing ram overflow or something like this.

LostRuins commented 10 months ago

Yeah the CUDA virtual memory pool thing is kind of finnicky. There was a recent change to the way pooled memory is handled for cublas. I will see what I can do.

Did you run it with lowvram? Can you try that?

LostRuins commented 10 months ago

Meanwhile, can I get your launch parameters?

Mintberry1 commented 10 months ago

Sorry i was very busy the last days. No i didn't use lowvram. I attached my settingfiles. I can test the option "lowvram", but yesterday i used 1.53 for hours with the same character card and Silly Tavern Setup. The difference is like night and day. Interestingly even when i closed 1.54 as i wrote before, the mouse cursor was still hanging, suggesting that perhaps a OS related buffer overflow or something occured. Overall performance was worse. With 1.53 though i get okay 6-10 T/s, relatively short prompt "fullcontextcrunching", 1.52 took much more. Considering that Mixtral Instruct is now the goto point for many roleplayers, because of 32k context, i hope the performance will improve and testing on this model prioritized. (Whatever component of koboldcpp is responsible and changed in 1.54 is unfortunately going in the wrong direction.)

Thank you though LostRuins for your amazing software. I wish for you to be blessed as much as you have blessed us with your amazing program! Though invested in gaming, i never thought before, that i could converse with my PC locally. (server farms like chatgpt aren't a "miracle" considering their "limitless" computing power.) Almost like you had a loyal dog as a pet for years and he suddenly starts speaking with you! (^_^)

MixtralQ4Kobold.zip

LostRuins commented 10 months ago

Can you try the latest release and see if the issue is solved? I switched the cuda malloc pool back.

Mintberry1 commented 10 months ago

Sorry i was very busy work related today i am testing now and the first prompt loading with Silly Tavern existing chat with 13047 tokens. First loading time: v.1.55.1: 281.5s loading (4min 41sec) next reply = (9.48T/s) v.1.54: 1932.9s loading (32min 12 sec) next reply = (9.87T/s) v.1.53: 1359.1s loading (22 min 39 sec) forgot to test next reply. but i think around 9 again.

Big improvement with 1.55.1 and a regression of 50% before from 1.53 to 1.54. In my testing i had the issue as well with: "Sometimes, editing the latest 200 tokens of text -> resamples all 3000 tokens of context. #614 https://github.com/LostRuins/koboldcpp/issues/614" i recast silly tavern responses and sometimes after 3-4 times it started to read all the context again. (v1.53)

notxkid commented 8 months ago

Still happening on version 1.59.1

LostRuins commented 8 months ago

If you're using multigpu, try toggle between Row and Layer split modes