(Fixed in version 1.68) High RAM usage while loading Llama3

LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

4.36k stars 312 forks source link

(Fixed in version 1.68) High RAM usage while loading Llama3 #817

Open FrostyMisa opened 2 months ago

FrostyMisa commented 2 months ago

Koboldccp 1.64, Hardware: Steamdeck

I was using Koboldcpp on Steamdeck LCD before with Vulkan and it works fast and great. After Llama3 I download latest version Koboldccp and even I use smaller Llama3 model, it looks like it takes 2 times more system RAM then for example Mistral model.

I will provide both logs if you can find something. I can load Mistral 7B Q6 model without problem. But I have problem run Llama3 8B Q3_K_M. In spike it takes around 15GB.

So it's loading Llama3 models different or Vulkan is not optimized for Llama3 now? Or something else?

https://wormhole.app/KvKBP#HPyt1TRWVKTwYT_uvQVKkw

LostRuins commented 2 months ago

Actually according to your logs about, the mistral model is using more memory than the llama 3 model. 15GB seems a bit much for a q3 8b.

FrostyMisa commented 2 months ago

According to the log yes, but you can see system monitor here in photos, Llama3 (the smaller one) spike to swap files. OpenBlas doesn't have this problem and the RAM usage is normal to the model size here. So it must be something with Vulkan.

FrostyMisa commented 2 months ago

I found someone write about Vulkan on Reddit and I tried version 1.61.2 and here it works like expected without the spikes.

Someone in the thread mention this: "We know 1.61 is the last version Vulkan works correct on, its because of a regression in Vulkan upstream that Occam didn't have time to submit his fixes for yet since its tied to MoE support. Will eventually be fixed, for now it's better to keep using 1.61 until you notice that we support MoE for Vulkan."

But it's definitely something between this version and 1.63 and 1.64. Those two I test and have problem loading Llama3 model.

Here is photo of RAM usage in 1.61. As you see, normal RAM usage, no spikes like in my photos I provide before in the last version.

henk717 commented 1 month ago

That reddit comment was by me. 1.65 will have the incoherency issues I was referencing fixed but the llama3 memory error we only discovered recently so that will remain a thing until occam finds a solution for that one seperately.

FrostyMisa commented 1 month ago

I can confirm even 1.65 have this RAM spikes problem with Llama3 model with Vulkan. So I will wait if someone fix it and I will report if it works again. Thanks guys for your hard work making Koboldcpp great!

MadLightTheDoggo commented 1 month ago

Yeah, i'm having exactly the same problem with any version above 1.61.2 In fact, on that one i can launch Mixtral 8x7 Q4_K_M with 8192 context with my 32gb memory easily, but on anything higher it fills all the memory and begins to spill out into the disk. Because of it i can't tell how much more memory it eats, but if the disk space is any indicator, i would say at least 10 gigs more. I thought that maybe it was fixed in recent builds, but nope.

FrostyMisa commented 2 weeks ago

I want to report, today's version 1.68 fixed the problem with high RAM spikes! It now works lightning fast and RAM usage is normal, without any spikes. Steam deck, Vulkan.

Thank you very much all of you behind KoboldCPP🙏