LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.48k stars 323 forks source link

Can't get it to load models into vram #197

Closed baphilia closed 1 year ago

baphilia commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

I'm trying to load part or all of the model into vram. I don't really understand the difference between all the blasts and blas stuff. I just need to use some gpu when loading a model, but only part of it. I want to load models that will require my vram and system ram.

Current Behavior

it all loads into system ram. if the model takes up more system ram than I have available it just keeps using more and more until the pc hard freezes

Environment and Context

OS Name: Microsoft Windows 10 Home OS Version: 10.0.19045 N/A Build 19045 intel i7-6700 32gb ram geforce 3060 12GB vram (latest drivers) Python 3.11.3 no make or g++

Failure Information (for bugs)

Steps to Reproduce

here are examples of what I've tried (I've also tried all of these with WizardLM-30B-Uncensored.ggmlv3.q5_1.bin):

koboldcpp.exe --gpulayers 30 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin
koboldcpp.exe --useclblast 0 0 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin
koboldcpp.exe --useclblast 0 0 --gpulayers 30 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin
koboldcpp.exe --gpulayers 10000 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin
koboldcpp.exe --gpulayers 10000 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin --smartcontext --useclblast 0 0
koboldcpp.exe --gpulayers 10000 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin --smartcontext --noblas

Failure Logs

This is an example output. It loads it all into system ram:

Z:\KoboldAI2\models>koboldcpp.exe --useclblast 0 0 --gpulayers 30 --model ./pygmalion-6bv3-ggml-ggjt/pygmalion-6b-v3-ggml-ggjt-q4_0.bin
Welcome to KoboldCpp - Version 1.25
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.dll
==========
Loading model: Z:\KoboldAI2\models\pygmalion-6bv3-ggml-ggjt\pygmalion-6b-v3-ggml-ggjt-q4_0.bin
[Threads: 3, BlasThreads: 3, SmartContext: False]

---
Identified as GPT-J model: (ver 102)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
gptj_v2_model_load: loading model from 'Z:\KoboldAI2\models\pygmalion-6bv3-ggml-ggjt\pygmalion-6b-v3-ggml-ggjt-q4_0.bin' - please wait ...
gptj_v2_model_load: n_vocab = 50400
gptj_v2_model_load: n_ctx   = 2048
gptj_v2_model_load: n_embd  = 4096
gptj_v2_model_load: n_head  = 16
gptj_v2_model_load: n_layer = 28
gptj_v2_model_load: n_rot   = 64
gptj_v2_model_load: ftype   = 2
gptj_v2_model_load: qntvr   = 0
gptj_v2_model_load: ggml ctx size = 4505.52 MB

Initializing LEGACY CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: NVIDIA CUDA Device: NVIDIA GeForce RTX 3060
gptj_v2_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_v2_model_load: ................................... done
gptj_v2_model_load: model size =  3609.38 MB / num tensors = 285
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
LostRuins commented 1 year ago

This is how it works currently. A copy of the weights are stored in RAM also because not all operations are performed in VRAM. Koboldcpp is primarily a CPU inference tool and functions with regular RAM.

baphilia commented 1 year ago

This is how it works currently. A copy of the weights are stored in RAM also because not all operations are performed in VRAM. Koboldcpp is primarily a CPU inference tool and functions with regular RAM.

oh I see. I've heard people talking about partially offloading some layers to gpu. Any idea how they're doing this (possibly not with koboldcpp)? Did I just misunderstand, and even in those cases they're storing it all in system ram, even though some of it is processed on the gpu?

LostRuins commented 1 year ago

At the moment yes, it's stored on both. Some experiments in CUDA for a full offload have been explored but none have been mainlined as far as I know.

mirek190 commented 1 year ago

llama.cpp is not keeping copy of layers in the RAM from some time being.

NoMansPC commented 1 year ago

It doesn't work for me too. Such a shame because, with my GPU, I think the speed could finally be acceptable. Oobabooga doesn't work properly for this purpose too. Guess we need to wait until there are other solutions and people have improved these codes.

tensiondriven commented 1 year ago

Please forgive my ignorance; would it be possible to add documentation describing what the parameters for --useclblast do? I too am interested in leveraging my GPU, and compiled on linux with clblas support, but it isn't clear to me how to leverage it.

Also, a possible enhancement for --gpu-layers might be an all or max parameter; i am running a 3090 with 24GB RAM on a 13B model, and i would like to use as much VRAM as needed.

Additionally, it isn't clear to me from looking at a model file (.bin / ggml) how many layers actually exist, that is, how many are available to load into GPU vs CPU. Perhaps it's listed in the CLI output?

LostRuins commented 1 year ago

This has been implemented in the latest version. Full layer stored in VRAM instead of RAM is now possible.

@tensiondriven sorry for the late reply - right now there's not much documentation besides the --help flag when launching. --useclblast attempts to use your GPU for prompt processing, combined with --gpulayers allows you to offload parts of the model to VRAM.

--useclblast requires 2 parameters, which are for the platform and device to use. Currently this can be viewed either with clinfo or via trial and error, just launching it will show you a list of all the available devices and platforms on your system.

bbecausereasonss commented 1 year ago

Sorry but can you explain this for me like i'm 5?

LostRuins commented 1 year ago

Launching with --useclblast flag and --gpulayers makes your PC use less RAM and more graphics memory.

baphilia commented 1 year ago

This has been implemented in the latest version. Full layer stored in VRAM instead of RAM is now possible.

Thank you so much. Just tested it and it works perfectly.