Open mercurial-moon opened 1 month ago
You will struggle to load such a big model in 32GB of RAM. Ideally, you'd want at least 64GB to do a partial offload for it, to avoid hitting swap.
First, try switching to a 70B q3_k_s first.
Then you can try disable mmap, and then offload as many layers to GPU as you can before it goes OOM.
Hi, Are there any special settings for running large models > 70B parameters on a PC low an memory and VRAM.
PC memory - 32GB VRAM - 12GB Model quantization - 5bit (k quants) (additional postfixes K_M) Model parameters - 70b
I tried it with Kobold cpp regular version (not the cuda one), and it showed close to 99% memory usage and high hdd usage. The model file is save on a ssd. After generated a few tokens 10 - 20 it just froze.
I'm sure the output would be slow maybe < 0.5 tokens /sec, but just wondering if there is a way to get it to work, by tweaking some settings in KoboldCpp.