ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.15k stars 9.34k forks source link

GGML_ASSERT: llama.cpp:14101: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected" #6838

Closed darcy1990 closed 3 months ago

darcy1990 commented 5 months ago

quantize after convert, problem occurs:

➜ llama ./llama.cpp/quantize ./chinese-llama-2-7b-hf/ggml-model-f16.gguf ./chinese-llama-2-7b-hf/ggml-model-q4_0.gguf 2 main: build = 2695 (bca40e98) main: built with Apple clang version 12.0.0 (clang-1200.0.32.29) for x86_64-apple-darwin19.4.0 main: quantizing './chinese-llama-2-7b-hf/ggml-model-f16.gguf' to './chinese-llama-2-7b-hf/ggml-model-q4_0.gguf' as Q4_0 llama_model_loader: loaded meta data with 21 key-value pairs and 217 tensors from ./chinese-llama-2-7b-hf/ggml-model-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.vocab_size u32 = 55296 llama_model_loader: - kv 3: llama.context_length u32 = 4096 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.block_count u32 = 32 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 32 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 11: general.file_type u32 = 1 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,55296] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,55296] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,55296] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - type f32: 48 tensors llama_model_loader: - type f16: 169 tensors GGML_ASSERT: llama.cpp:14101: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected" [1] 19747 abort ./llama.cpp/quantize ./chinese-llama-2-7b-hf/ggml-model-f16.gguf

os: MacBook Pro, with Intel Core, 14.4.1 (23E224)

using the lastest llama.cpp

model: 7b from https://github.com/ymcui/Chinese-LLaMA-Alpaca-2?tab=readme-ov-file

goroggy commented 4 months ago

Possibly same as #6702

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.