ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.95k stars 9.31k forks source link

Feature Request: Avoid loading GPU layers into RAM before moving them to VRAM. This should allow the use of --no-mmap with models that do not fit in RAM but fit in RAM+VRAM. #9059

Open ThomasBaruzier opened 1 month ago

ThomasBaruzier commented 1 month ago

Prerequisites

Feature Description

Hello,

I am currently working on running Llama 405B and DeepSeek Coder V2 on my setup, which includes 128GB of RAM and 24GB of VRAM.

To run these large models effectively, I need to avoid disk caching, as it severely impacts performance. This is why I am using the --no-mmap option.

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

Given this, the largest models I can run without dipping into painfully slow token-per-minute territory are limited by my RAM capacity.

It would be highly beneficial if the --no-mmap option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.

With these modifications, we could load larger models and higher quantization levels with minimal speed loss, and avoid relying on disk caching when a model fits within the combined RAM + VRAM but not in RAM alone.

Here are the current speeds I achieve with Llama 3.1 405B Instruct, offloading the maximum number of layers for each:

Model Quant Size (MB) Speed (tok/s) --no-mmap
IQ2_S 121,544 0.42 Enabled
IQ2_M 132,116 0.38 Enabled
IQ3_XXS 150,407 Crash Enabled
IQ3_XXS 150,407 0.02 Disabled

Motivation

It would be very useful to be able to load larger models with higher quants without having to rely on disk caching, greatly improving the speed of these models.

Possible Implementation

It would be highly beneficial if the --no-mmap option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.

slaren commented 1 month ago

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

This is not the case.

ThomasBaruzier commented 1 month ago

The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.

This is not the case.

Alright, I'm going to try again. I'll come back and edit my reply when I find something.

fairydreaming commented 4 weeks ago

@ThomasBaruzier What is the context size that you use? Because the larger the context size, the more memory is used by the KV buffer, I'm not sure if you take it into account in your memory usage calculations.

ThomasBaruzier commented 4 weeks ago

What is the context size that you use?

Very small: --ctx-size of 1024 Cache F16, no FA

I don't know when I will have the time to retry since every model load takes 10min using a hard drive, but I'll get back to it soon enough.

WilliamTambellini commented 2 weeks ago
/usr/bin/time -v bin/llama-cli -m /llms/qwen2-1.5Bf16/qwen2-1_5b-instruct-fp16.gguf -ngl 99 -f /llms/test2.txt
...
...
llm_load_print_meta: model size       = 2.88 GiB (16.00 BPW)
...
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   445.12 MiB
llm_load_tensors:      CUDA0 buffer size =  2944.68 MiB
...
Maximum resident set size (kbytes): 4218296

without mmap:

$ /usr/bin/time -v bin/llama-cli -m /llms/qwen2-1.5Bf16/qwen2-1_5b-instruct-fp16.gguf -ngl 99 -f /llms/test2.txt --no-mmap
...
Maximum resident set size (kbytes): 1657868