Open ThomasBaruzier opened 1 month ago
The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.
This is not the case.
The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.
This is not the case.
Alright, I'm going to try again. I'll come back and edit my reply when I find something.
@ThomasBaruzier What is the context size that you use? Because the larger the context size, the more memory is used by the KV buffer, I'm not sure if you take it into account in your memory usage calculations.
What is the context size that you use?
Very small: --ctx-size of 1024 Cache F16, no FA
I don't know when I will have the time to retry since every model load takes 10min using a hard drive, but I'll get back to it soon enough.
/usr/bin/time -v bin/llama-cli -m /llms/qwen2-1.5Bf16/qwen2-1_5b-instruct-fp16.gguf -ngl 99 -f /llms/test2.txt
...
...
llm_load_print_meta: model size = 2.88 GiB (16.00 BPW)
...
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 445.12 MiB
llm_load_tensors: CUDA0 buffer size = 2944.68 MiB
...
Maximum resident set size (kbytes): 4218296
without mmap:
$ /usr/bin/time -v bin/llama-cli -m /llms/qwen2-1.5Bf16/qwen2-1_5b-instruct-fp16.gguf -ngl 99 -f /llms/test2.txt --no-mmap
...
Maximum resident set size (kbytes): 1657868
Prerequisites
Feature Description
Hello,
I am currently working on running Llama 405B and DeepSeek Coder V2 on my setup, which includes 128GB of RAM and 24GB of VRAM.
To run these large models effectively, I need to avoid disk caching, as it severely impacts performance. This is why I am using the
--no-mmap
option.The problem is that llama.cpp initially loads the entire model and its layers into RAM before offloading some layers into VRAM.
Given this, the largest models I can run without dipping into painfully slow token-per-minute territory are limited by my RAM capacity.
It would be highly beneficial if the
--no-mmap
option could be applied only to the layers that remain in RAM, or if the necessary layers could be directly loaded into VRAM.With these modifications, we could load larger models and higher quantization levels with minimal speed loss, and avoid relying on disk caching when a model fits within the combined RAM + VRAM but not in RAM alone.
Here are the current speeds I achieve with Llama 3.1 405B Instruct, offloading the maximum number of layers for each:
Motivation
It would be very useful to be able to load larger models with higher quants without having to rely on disk caching, greatly improving the speed of these models.
Possible Implementation