gab-luz commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

A field specifying the percentage of GPU layers to offload would allow us to indicate the desired extent of offloading without explicitly specifying the number of layers. Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do. I've just loaded kobold.cpp

Current Behavior

It only allows me to specify the number of layers, which is unknown and it seems to change according to the model.

Please provide a detailed written description of what llama.cpp did, instead. I didn't use llama for that.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu AMD Ryzen 5 PRO 5000 with Radeon Graphics

Operating System, e.g. for Linux: linux mint

$ uname -a linux

SDK version, e.g. for Linux:

$ python3 --version
3.10.10

$ make --version
4.3

$ g++ --version
11.3.0

JHawkley commented 1 year ago

Why not just specify the amount of VRAM to use in offloading and just have the program calculate the number of layers that would fit the budget? I get segfaults on my 16GB AMD card if I use more than 12 GBs, so I'm often having to tweak the --gpulayers value to fit it into that budget, and this value depends on the type and size of the model. It's trial-and-error to find the magic number you need to properly offload to GPU within a specific memory budget.

gab-luz commented 1 year ago

Yes, that would be a good solution. Is that solution available already? If no, please open an issue. I'll be following up.

LostRuins commented 1 year ago

It is difficult to get an accurate estimate, especially because different models have different layer sizes, and different overheads as well. Additionally, prompt processing also requires extra VRAM to be set aside for temporary buffers during GEMM. That's why upstream also requires the users to manually specify number of layers to use. So if you have a better solution, it's worth following up at the llama.cpp repo first.

gab-luz commented 1 year ago

What if we have a config set for each model? In my machine, I've created a txt file with appropriate settings for each model so I don't have to think how mach layers should give to it. If you don't think this is viable, I think you can safely close this issue.

LostRuins commented 1 year ago

Yeap I think having settings for each model would be the best approach. Since Koboldcpp supports launcher args, you can include the .bat file with the recommended params when distributing model files.

LostRuins / koboldcpp

Create a field of "% of gpu layers to offload" #264

Prerequisites

Expected Behavior

Current Behavior

Environment and Context