Feature Request: Avoid loading GPU layers into RAM before moving them to VRAM. This should allow the use of --no-mmap with models that do not fit in RAM but fit in RAM+VRAM.

ThomasBaruzier commented 1 month ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hello,

I am currently working on running Llama 405B and DeepSeek Coder V2 on my setup, which includes 128GB of RAM and 24GB of VRAM.

To run these large models effectively, I need to avoid disk caching, as it severely impacts performance. This is why I am using the --no-mmap option.

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

Given this, the largest models I can run without dipping into painfully slow token-per-minute territory are limited by my RAM capacity.

It would be highly beneficial if the --no-mmap option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.

With these modifications, we could load larger models and higher quantization levels with minimal speed loss, and avoid relying on disk caching when a model fits within the combined RAM + VRAM but not in RAM alone.

Here are the current speeds I achieve with Llama 3.1 405B Instruct, offloading the maximum number of layers for each:

Model Quant	Size (MB)	Speed (tok/s)	--no-mmap
IQ2_S	121,544	0.42	Enabled
IQ2_M	132,116	0.38	Enabled
IQ3_XXS	150,407	Crash	Enabled
IQ3_XXS	150,407	0.02	Disabled

Motivation

It would be very useful to be able to load larger models with higher quants without having to rely on disk caching, greatly improving the speed of these models.

Possible Implementation

It would be highly beneficial if the --no-mmap option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.

slaren commented 1 month ago

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

This is not the case.

ThomasBaruzier commented 1 month ago

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

This is not the case.

Alright, I'm going to try again. I'll come back and edit my reply when I find something.

fairydreaming commented 4 weeks ago

@ThomasBaruzier What is the context size that you use? Because the larger the context size, the more memory is used by the KV buffer, I'm not sure if you take it into account in your memory usage calculations.

ThomasBaruzier commented 4 weeks ago

What is the context size that you use?

Very small: --ctx-size of 1024 Cache F16, no FA

I don't know when I will have the time to retry since every model load takes 10min using a hard drive, but I'll get back to it soon enough.

WilliamTambellini commented 2 weeks ago

/usr/bin/time -v bin/llama-cli -m /llms/qwen2-1.5Bf16/qwen2-1_5b-instruct-fp16.gguf -ngl 99 -f /llms/test2.txt
...
...
llm_load_print_meta: model size       = 2.88 GiB (16.00 BPW)
...
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   445.12 MiB
llm_load_tensors:      CUDA0 buffer size =  2944.68 MiB
...
Maximum resident set size (kbytes): 4218296

without mmap:

$ /usr/bin/time -v bin/llama-cli -m /llms/qwen2-1.5Bf16/qwen2-1_5b-instruct-fp16.gguf -ngl 99 -f /llms/test2.txt --no-mmap
...
Maximum resident set size (kbytes): 1657868

ggerganov / llama.cpp