LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 312 forks source link

running a 70b model #858

Open mercurial-moon opened 1 month ago

mercurial-moon commented 1 month ago

Hi, Are there any special settings for running large models > 70B parameters on a PC low an memory and VRAM.

PC memory - 32GB VRAM - 12GB Model quantization - 5bit (k quants) (additional postfixes K_M) Model parameters - 70b

I tried it with Kobold cpp regular version (not the cuda one), and it showed close to 99% memory usage and high hdd usage. The model file is save on a ssd. After generated a few tokens 10 - 20 it just froze.

I'm sure the output would be slow maybe < 0.5 tokens /sec, but just wondering if there is a way to get it to work, by tweaking some settings in KoboldCpp.

LostRuins commented 1 month ago

You will struggle to load such a big model in 32GB of RAM. Ideally, you'd want at least 64GB to do a partial offload for it, to avoid hitting swap.

First, try switching to a 70B q3_k_s first.

Then you can try disable mmap, and then offload as many layers to GPU as you can before it goes OOM.