LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.25k stars 360 forks source link

Generation speed on high-performance graphics cards #455

Open Vladonai opened 1 year ago

Vladonai commented 1 year ago

I noticed that the processing of the model layers loaded into the graphics card's memory is now handled by the graphics card itself, not the processor. And it speeds up generation considerably. If I'm not mistaken - my graphics card is weak and only has 8GB of video memory, so it's hard for me to draw any conclusions. When running the lightest version of the model at Llama-2-70B, I get >800 and <1000 milliseconds per token; this value is at the very edge of comfort. I'm interested in the results with models of this size for owners of powerful graphics cards with 24GB of video memory - RTX 3090, 4090. They can take such a model with 6-bit quantization and load almost all layers into video memory. I'm interested in how many tokens per second they get when generating in that case.

I believe that it is 70B models that have already reached an acceptable level for local models, so these results are important in terms of the power that a modern home computer must have to handle local neural networks today. The minimum acceptable is of course my results :) , and the maximum I'd like to know.

LostRuins commented 1 year ago

For myself i generally still stick to 13B models as they have most enjoyable speed-power ratio. 70B is still too slow for me haha.

Vladonai commented 1 year ago

Here's what I found on my question: https://www.reddit.com/r/LocalLLaMA/comments/16z3goq/llm_chatrp_comparisontest_dolphinmistral/

With this setup:

ASUS ProArt Z790 workstation with NVIDIA GeForce RTX 3090 (24 GB VRAM), Intel Core i9-13900K CPU @ 3.0-5.8 GHz (24 cores, 8 performance + 16 efficient, 32 threads), and 128 GB RAM (Kingston Fury Beast DDR5-6000 MHz @ 4800 MHz):

I get these speeds with KoboldCpp:

13B @ Q8_0 (40 layers + cache on GPU): Processing: 1ms/T, Generation: 39ms/T, Total: 17.2T/s

34B @ Q4_K_M (48/48 layers on GPU): Processing: 9ms/T, Generation: 96ms/T, Total: 3.7T/s

70B @ Q4_0 (40/80 layers on GPU): Processing: 21ms/T, Generation: 594ms/T, Total: 1.2T/s

180B @ Q2_K (20/80 layers on GPU): Processing: 60ms/T, Generation: 174ms/T, Total: 1.9T/s

I've now added a second 3090 to my setup and am still in the process of benchmarking, but I can get 4.1T/s with 70B @ Q4_0 now.

I'm also experimenting with ExLlama which has given me between 10 and 20 T/s with GPTQ models, but the quality seems lower.

The results of the 70B_Q4 vs. 180B_Q2 model are surprising.

VL4DST3R commented 1 year ago

I'm also experimenting with ExLlama which has given me between 10 and 20 T/s with GPTQ models, but the quality seems lower.

You reckon the quality decrease is due to ExLlama itself or the way GPTQ models work/are quantised?

Vladonai commented 1 year ago

I'm also experimenting with ExLlama which has given me between 10 and 20 T/s with GPTQ models, but the quality seems lower.

You reckon the quality decrease is due to ExLlama itself or the way GPTQ models work/are quantised?

I join in that question. Of course, this is a subjective assessment, but the increase in speed has to come at the expense of something. In any case, from a practical point of view, using Exllama is unrealistic - too much video memory is required. Koboldcpp is our everything! :)

VL4DST3R commented 1 year ago

I actually came here looking through issues for info on exllama and if it is something that koboldcpp would ever support/benefit from or if indeed was ever brought up at all. Maybe @LostRuins could chime in?