LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.41k stars 319 forks source link

Why prompt processing with few layers offloaded vs. all is so much slower? #737

Open krzysiekpodk opened 4 months ago

krzysiekpodk commented 4 months ago

Like in the title - its almost 20 times faster, I was thinking if there would be a way to move layers from gpu to cpu and from cpu to gpu to process very long prompt (i.e. over 32k) even if it would mean reloading the model few times.

askmyteapot commented 4 months ago

A very simplified answer: CPU - 32GB GPU - 24GB

Few layers loaded

CPU - 12GB (RAM at ~50GB/s) GPU - 14GB (RAM at ~300GB/s) Minimal data over PCIE bus (BUS at ~ 16GB/s) Average memory speed is about 175GB/s

All layers loaded GPU

CPU - 2GB GPU - 24GB Lots of data over PCIE bus Average memory speed is about 16GB/s

With the all loaded example. (I'm assuming windows and Cuda Sys Mem fallback is not disabled [the new default], then it will overflow into system memory over the PCIE bus and thats why its about 10x slower.

The fastest speed will be achieved though maximising the number of layers offloaded to the GPU without overflowing. I would recommend going into the CUDA control panel and disabling the Cuda System Memory Fallback globally. This way it will just crash Koboldcpp rather than slowing down if you try to load more layers (and context) than what can fit in the GPU.

Hope that helps.

krzysiekpodk commented 4 months ago

thanks for reply!!! Actually even for q4 mixtral ( like beyonder with 4 experts) model ,that fully fits into single GPU if I dont offload all layers prompt processing is so much slower - but entire kv cache is in gpu anyway.

Today I also noticed that adding MMQ gived 50% boost which also doesnt make any sense to me

The use case Im trying to look at is to use GPU+CPU inference for very long prompts (i.e. more than 100k tokens)

exllamav2 is blazing fast with q4 cache and i can fit 90k tokens in dual 3090 - prompt processing is at 500-1000tps, but its makes everything so hot, I cant stand being in the same room.

With just llama.cpp prompt processing is dead slow if I even offload a single layer.

Kobold does something different as its at least few times faster than text generation but its still like 20 / 50 / 80 / 130 tps max on prompt processing

I understand that the more layers I offload the faster PP is - I just dont get it, cache is fully on GPU...

So I was thinking, what is actual technical limitation here, if processing 90k prompt takes 50 minutes - it would be worth to explore some alternative way to process prompt - very big batches, moving layers to gpu and back to cpu etc?

edit: fully offload in kobold will also give around 500 tokens per seconf for PP

askmyteapot commented 4 months ago

From my experience, mixtral8x7B quantized to Q3KS (3.5BPW) at 8k context will fit into 1x 24GB GPU (with MMQ on) and have less than 1GB free VRAM As for multi GPU, i'm afraid i dont have any experience offloading with koboldcpp.

MMQ will slightly reduce the amount of VRAM used compared to ordinary cuda (but is a little slower on newer GPUs)

My guess is that you will need to tweak the number of layers on each GPU. Also, the way that koboldcpp splits the kv cache is equally with each layer. That behavior can be disabled with the lowvram toggle in the gui.

Otherwise, i would suggest jumping onto the discord group and asking the question there. There are others on there that frequently use multi-gpu setups.

LostRuins commented 4 months ago

KV Cache is not fully on GPU. In recent times for CUDA it has been refactored to use per-layer KV. So the amount of KV offloaded is proportional to the number of layers offloaded.

krzysiekpodk commented 4 months ago

@LostRuins I think I found the associated PR and it looks like before that change llama.cpp prompt processing was super low even if one layer was in cpu as whole cache was on cpu?

The code is quite complex there, but maybe I dont need to make any changes there and I would do following:

This would allow for very big context on CPU or CPU + GPU

I recently got 8channel ram - its worse than i anticipated but would be good enough for running mixtral in a very long prompts

This would also allow me to keep the server in home as CPU doesnt generate almost any heat and noise, as for prompt processing with power limit it will spin fans to 60% for sometime, but quickly cooldown again

what do you think?

LostRuins commented 4 months ago

Hmm, I'm not sure I'd be able to do that correctly. You're welcome to attempt a PR if you like. But my advice would be for people to just remove a few layers from GPU and that should make it good to go

aleksusklim commented 4 months ago

For me with Mixtral 8x7b it looked like using CuBLAS with 0 (zero) offloaded layers and no lowvram options – gives the best speed! I have 128 Gb or RAM and 12 Gb of VRAM.

Even offloading a single layer of Mixtral decreases performance. Of course I didn't try offloading "all" here, because it takes up to 80 Gb of RAM at 64k context, so if would never fit into my VRAM, causing the overflow to the shared memory right away.

But, even with offloading zero layers – there are a lot of Shared GPU Memory used! Which looks strange, since Dedicated Memory is almost empty. My shared memory never fills up when using other neural networks (image generation) unless dedicated is already almost full.

Still, CuBLAS with 0 layers is very good with Mixtral for me! BLAS stage is fast. I haven't benchmarked the most recent koboldcpp version though.

strikaco commented 3 months ago

For me with Mixtral 8x7b it looked like using CuBLAS with 0 (zero) offloaded layers and no lowvram options – gives the best speed! I have 128 Gb or RAM and 12 Gb of VRAM.

Even offloading a single layer of Mixtral decreases performance. Of course I didn't try offloading "all" here, because it takes up to 80 Gb of RAM at 64k context, so if would never fit into my VRAM, causing the overflow to the shared memory right away.

But, even with offloading zero layers – there are a lot of Shared GPU Memory used! Which looks strange, since Dedicated Memory is almost empty. My shared memory never fills up when using other neural networks (image generation) unless dedicated is already almost full.

Same experience, but not limited to Mixtral-- I'm alternating between Wizard-Vicuna-7B and Mixtral. The best performance for both that I'm seeing is exactly as you reported, with cublas enabled and no gpulayers offloaded. Despite low VRAM (4GB Quattro K2200 and nvidia-550) I'm not setting the lowvram flag either. This is for versions as recent as 1.61.1 and 1.61.2.

Offering my $0.02 to back up yours since all conventional advice directs everyone to offload as many layers as possible ("even just one!"), which only ever seems to make performance worse in my experience. I also have fewer OOM errors by omitting gpulayers.

LostRuins commented 3 months ago

A lot of performance issues can be due to your setup, the GPU card you have and available VRAM. It is possible for a fast CPU to outperform a very lousy card. Best to trial and error. I do notice that to be especially the case with MoE models.