Closed olaplateaumed closed 1 month ago
I have a similar problem on a Linux laptop. I am running a LLM whose gguf is 4,5 GB in size and the program is using 17 GB of memory to run an inference. If I had less than 17 GB of memory, it would have crashed.
Try to add argument -c 1024 to limit the context window size.
By adding the option "-c 1024" the memory usage drops from 17 GB to 220 MB, with the same model and same settings. Is that normal? The model has a default context size of 128K.
Yes, that behaves as expected. Here model default context size means the model is able to deal with context size of 128K. It means model capability. We need to limit it to a smaller number to fit the device memory.
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
I tried to run this command:
./llama-cli -m phi3:latest.gguf -p "I believe the meaning of life is" -n 128
and it fails to load the model with the following error:
llama_init_from_gpt_params: error: failed to create context with model 'phi3:latest.gguf'
I usually run ollama with no issues on this same machine. And I just thought to try out llama.cpp using a light weight model like Phi3 but it looks like llama.cpp is failing to allocate memory. Note: this same commands work for llama models. e.g
llama3:8b.gguf
works fine. could it be a phi3 issue? do i need some extra configs?Laptop specs: Apple Macbook pro with M1 Pro Mem: 16GB OS: Macos Sonoma 14.6
Name and Version
./llama-cli --version
version: 3590 (4b9afbbe) built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.6.0
What operating system are you seeing the problem on?
Mac
Relevant log output