abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.15k stars 848 forks source link

Qwen-7B-Instruct Model numpy.core._exceptions._ArrayMemoryError: #1542

Open khoinpd0411 opened 3 weeks ago

khoinpd0411 commented 3 weeks ago

I cannot run Qwen2-7B-instruct quantized version locally. System keep notifying about MemoryError which seems quite strange. The same problem does not happen with other models such as Mistral-7B-instruct. I have also tried a lower-bit quantized version but it does not work out. My local configuration is 16 GB CPU RAM.

I have also run to update the latest version for package pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /home/user/llama.cpp/models/Qwen2/qwen2-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = qwen2-7b llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 17 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q5_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors llm_load_vocab: special tokens cache size = 421 llm_load_vocab: token to piece cache size = 0.9352 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.62 B llm_load_print_meta: model size = 5.07 GiB (5.71 BPW) llm_load_print_meta: general.name = qwen2-7b llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_tensors: ggml ctx size = 0.16 MiB llm_load_tensors: CPU buffer size = 5186.92 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB llama_new_context_with_model: CPU output buffer size = 0.58 MiB llama_new_context_with_model: CPU compute buffer size = 304.00 MiB llama_new_context_with_model: graph nodes = 875 llama_new_context_with_model: graph splits = 1 AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | Traceback (most recent call last):

File "/home/user/anaconda3/envs/llm-app/lib/python3.8/site-packages/llama_cpp/llama.py", line 406, in init self.scores: npt.NDArray[np.single] = np.ndarray( numpy.core._exceptions._ArrayMemoryError: Unable to allocate 18.6 GiB for an array with shape (32768, 152064) and data type float32

abetlen commented 2 weeks ago

@khoinpd0411 sorry about that, currently the Llama class keeps all past logits in memory which can take up a lot of memory for larger context sizes, I do plan to fix this in the future but it requires a larger change to the Llama class internals. However, for now I would recommend reducing your context size from 32k and that should work.

khoinpd0411 commented 2 weeks ago

Thank you so much for your response! Reducing the context size actually helps to load the model. It is also noticed that that even the default context size for Qwen2-7B and Mistral-7B are both 32k, the Qwen2-7B's vocab size is 4 times larger compared to its encounter, leading to the memory problem which does not occur in Mistral-7B.

abetlen commented 2 weeks ago

@khoinpd0411 the size of that array is also proportional to vocab size, Qwen2 has a vocab of 150k while Mistral only has a 32k vocab.