Bug: Unable to load phi3:3B(2.2GB) model on Apple M1 Pro

olaplateaumed commented 2 months ago

What happened?

I tried to run this command: ./llama-cli -m phi3:latest.gguf -p "I believe the meaning of life is" -n 128

and it fails to load the model with the following error: llama_init_from_gpt_params: error: failed to create context with model 'phi3:latest.gguf'

I usually run ollama with no issues on this same machine. And I just thought to try out llama.cpp using a light weight model like Phi3 but it looks like llama.cpp is failing to allocate memory. Note: this same commands work for llama models. e.g llama3:8b.gguf works fine. could it be a phi3 issue? do i need some extra configs?

Laptop specs: Apple Macbook pro with M1 Pro Mem: 16GB OS: Macos Sonoma 14.6

Name and Version

./llama-cli --version
version: 3590 (4b9afbbe) built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.6.0

What operating system are you seeing the problem on?

Mac

Relevant log output

Log start
main: build = 3590 (4b9afbbe)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin23.6.0
main: seed  = 1723749553
llama_model_loader: loaded meta data with 36 key-value pairs and 197 tensors from /Users/olaadeba/Downloads/phi3:latest.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 3 Mini 128k Instruct
llama_model_loader: - kv   3:                           general.finetune str              = 128k-instruct
llama_model_loader: - kv   4:                           general.basename str              = Phi-3
llama_model_loader: - kv   5:                         general.size_label str              = mini
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   8:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv   9:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  10:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  11:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  12:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  13:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  14:                           phi3.block_count u32              = 32
llama_model_loader: - kv  15:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv  16:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv  17:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  18:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  19:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  20:                          general.file_type u32              = 2
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.1685 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 262144
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = Phi 3 Mini 128k Instruct
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.21 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  2021.84 MiB, ( 2021.91 / 11453.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  2021.83 MiB
llm_load_tensors:        CPU buffer size =    52.84 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 12009.34 MB
llama_kv_cache_init:      Metal KV buffer size = 49152.00 MiB
llama_new_context_with_model: KV self size  = 49152.00 MiB, K (f16): 24576.00 MiB, V (f16): 24576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer, size =  8484.02 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 8896122880
llama_new_context_with_model: failed to allocate compute buffers
ggml_metal_free: deallocating
llama_init_from_gpt_params: error: failed to create context with model '/Users/olaadeba/Downloads/phi3:latest.gguf'
main: error: unable to load model

mario7421 commented 2 months ago

I have a similar problem on a Linux laptop. I am running a LLM whose gguf is 4,5 GB in size and the program is using 17 GB of memory to run an inference. If I had less than 17 GB of memory, it would have crashed.

njsyw1997 commented 2 months ago

Try to add argument -c 1024 to limit the context window size.

mario7421 commented 2 months ago

By adding the option "-c 1024" the memory usage drops from 17 GB to 220 MB, with the same model and same settings. Is that normal? The model has a default context size of 128K.

njsyw1997 commented 2 months ago

Yes, that behaves as expected. Here model default context size means the model is able to deal with context size of 128K. It means model capability. We need to limit it to a smaller number to fit the device memory.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp