Open ngxson opened 7 months ago
@slaren Sorry for bothering you again, I leave this bug report here so you can take a look when you want. It's not urgent. Thank you!
I have opened #5653, but this requires changes in the backends and it is not priority at the moment.
This issue was closed because it has been inactive for 14 days since being marked as stale.
The
llama_kv_cache_seq_shift
orllama_kv_cache_seq_rm
(or all two of them) is broken with cache type q4_0 for K.In the
main.cpp
, these functions are used for "context swapping", meaning we can remove old tokens from sequence to make place for new tokens.My command:
./main -m ../dolphin-2.0-mistral-7b.Q4_K_M.gguf -p "test" -n 50 --cache-type-k q4_0 -c 10
(It does work normal without the
--cache-type-k q4_0
)See the log below for more details:
stdout / stderr
``` Log start main: build = 2232 (7fe4678b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1708557165 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../dolphin-2.0-mistral-7b.Q4_K_M.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = ehartford_dolphin-2.0-mistral-7b llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = ehartford_dolphin-2.0-mistral-7b llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 'main.log
``` [1708557165] Log start [1708557165] Cmd: ./main -m ../dolphin-2.0-mistral-7b.Q4_K_M.gguf -p test -n 50 --cache-type-k q4_0 -c 10 [1708557165] main: build = 2232 (7fe4678b) [1708557165] main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu [1708557165] main: seed = 1708557165 [1708557165] main: llama backend init [1708557165] main: load the model and apply lora adapter, if any [1708557165] llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../dolphin-2.0-mistral-7b.Q4_K_M.gguf (version GGUF V2) [1708557165] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. [1708557165] llama_model_loader: - kv 0: general.architecture str = llama [1708557165] llama_model_loader: - kv 1: general.name str = ehartford_dolphin-2.0-mistral-7b [1708557165] llama_model_loader: - kv 2: llama.context_length u32 = 32768 [1708557165] llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 [1708557165] llama_model_loader: - kv 4: llama.block_count u32 = 32 [1708557165] llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 [1708557165] llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 [1708557165] llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 [1708557165] llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 [1708557165] llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 [1708557165] llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 [1708557165] llama_model_loader: - kv 11: general.file_type u32 = 15 [1708557165] llama_model_loader: - kv 12: tokenizer.ggml.model str = llama [1708557165] llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<... [1708557165] llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... [1708557165] llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... [1708557165] llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 [1708557165] llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 [1708557165] llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 [1708557165] llama_model_loader: - kv 19: general.quantization_version u32 = 2 [1708557165] llama_model_loader: - type f32: 65 tensors [1708557165] llama_model_loader: - type q4_K: 193 tensors [1708557165] llama_model_loader: - type q6_K: 33 tensors [1708557165] llm_load_vocab: special tokens definition check successful ( 259/32000 ). [1708557165] llm_load_print_meta: format = GGUF V2 [1708557165] llm_load_print_meta: arch = llama [1708557165] llm_load_print_meta: vocab type = SPM [1708557165] llm_load_print_meta: n_vocab = 32000 [1708557165] llm_load_print_meta: n_merges = 0 [1708557165] llm_load_print_meta: n_ctx_train = 32768 [1708557165] llm_load_print_meta: n_embd = 4096 [1708557165] llm_load_print_meta: n_head = 32 [1708557165] llm_load_print_meta: n_head_kv = 8 [1708557165] llm_load_print_meta: n_layer = 32 [1708557165] llm_load_print_meta: n_rot = 128 [1708557165] llm_load_print_meta: n_embd_head_k = 128 [1708557165] llm_load_print_meta: n_embd_head_v = 128 [1708557165] llm_load_print_meta: n_gqa = 4 [1708557165] llm_load_print_meta: n_embd_k_gqa = 1024 [1708557165] llm_load_print_meta: n_embd_v_gqa = 1024 [1708557165] llm_load_print_meta: f_norm_eps = 0.0e+00 [1708557165] llm_load_print_meta: f_norm_rms_eps = 1.0e-05 [1708557165] llm_load_print_meta: f_clamp_kqv = 0.0e+00 [1708557165] llm_load_print_meta: f_max_alibi_bias = 0.0e+00 [1708557165] llm_load_print_meta: n_ff = 14336 [1708557165] llm_load_print_meta: n_expert = 0 [1708557165] llm_load_print_meta: n_expert_used = 0 [1708557165] llm_load_print_meta: rope scaling = linear [1708557165] llm_load_print_meta: freq_base_train = 10000.0 [1708557165] llm_load_print_meta: freq_scale_train = 1 [1708557165] llm_load_print_meta: n_yarn_orig_ctx = 32768 [1708557165] llm_load_print_meta: rope_finetuned = unknown [1708557165] llm_load_print_meta: model type = 7B [1708557165] llm_load_print_meta: model ftype = Q4_K - Medium [1708557165] llm_load_print_meta: model params = 7.24 B [1708557165] llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) [1708557165] llm_load_print_meta: general.name = ehartford_dolphin-2.0-mistral-7b [1708557165] llm_load_print_meta: BOS token = 1 '' [1708557165] llm_load_print_meta: EOS token = 2 '' [1708557165] llm_load_print_meta: UNK token = 0 '