Closed baphilia closed 1 year ago
This is how it works currently. A copy of the weights are stored in RAM also because not all operations are performed in VRAM. Koboldcpp is primarily a CPU inference tool and functions with regular RAM.
This is how it works currently. A copy of the weights are stored in RAM also because not all operations are performed in VRAM. Koboldcpp is primarily a CPU inference tool and functions with regular RAM.
oh I see. I've heard people talking about partially offloading some layers to gpu. Any idea how they're doing this (possibly not with koboldcpp)? Did I just misunderstand, and even in those cases they're storing it all in system ram, even though some of it is processed on the gpu?
At the moment yes, it's stored on both. Some experiments in CUDA for a full offload have been explored but none have been mainlined as far as I know.
llama.cpp is not keeping copy of layers in the RAM from some time being.
It doesn't work for me too. Such a shame because, with my GPU, I think the speed could finally be acceptable. Oobabooga doesn't work properly for this purpose too. Guess we need to wait until there are other solutions and people have improved these codes.
Please forgive my ignorance; would it be possible to add documentation describing what the parameters for --useclblast
do? I too am interested in leveraging my GPU, and compiled on linux with clblas support, but it isn't clear to me how to leverage it.
Also, a possible enhancement for --gpu-layers
might be an all
or max
parameter; i am running a 3090 with 24GB RAM on a 13B model, and i would like to use as much VRAM as needed.
Additionally, it isn't clear to me from looking at a model file (.bin / ggml) how many layers actually exist, that is, how many are available to load into GPU vs CPU. Perhaps it's listed in the CLI output?
This has been implemented in the latest version. Full layer stored in VRAM instead of RAM is now possible.
@tensiondriven sorry for the late reply - right now there's not much documentation besides the --help
flag when launching. --useclblast
attempts to use your GPU for prompt processing, combined with --gpulayers
allows you to offload parts of the model to VRAM.
--useclblast
requires 2 parameters, which are for the platform and device to use. Currently this can be viewed either with clinfo or via trial and error, just launching it will show you a list of all the available devices and platforms on your system.
Sorry but can you explain this for me like i'm 5?
Launching with --useclblast flag and --gpulayers makes your PC use less RAM and more graphics memory.
This has been implemented in the latest version. Full layer stored in VRAM instead of RAM is now possible.
Thank you so much. Just tested it and it works perfectly.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I'm trying to load part or all of the model into vram. I don't really understand the difference between all the blasts and blas stuff. I just need to use some gpu when loading a model, but only part of it. I want to load models that will require my vram and system ram.
Current Behavior
it all loads into system ram. if the model takes up more system ram than I have available it just keeps using more and more until the pc hard freezes
Environment and Context
OS Name: Microsoft Windows 10 Home OS Version: 10.0.19045 N/A Build 19045 intel i7-6700 32gb ram geforce 3060 12GB vram (latest drivers) Python 3.11.3 no make or g++
Failure Information (for bugs)
Steps to Reproduce
here are examples of what I've tried (I've also tried all of these with WizardLM-30B-Uncensored.ggmlv3.q5_1.bin):
Failure Logs
This is an example output. It loads it all into system ram: