ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.95k stars 9.75k forks source link

Bug: !!Severly Performance Degration when Using llama.cpp to deploy a pruned llama3.1 model #9818

Open gudehhh666 opened 1 month ago

gudehhh666 commented 1 month ago

What happened?

Hi, When I use llama.cpp to deploy a pruned llama3.1-8b model, a unbearable performance degration appears: We useing a structed pruning method(LLM-Pruner) to prune llama3.1-8b, we cut 30% params for each layer from layer4 to layer29 and save it to hf format, then conver it to gguf format using official conversion script.

We can use llama.cpp to load the pruned gguf model and generate the answer, however we find the output from pruned gguf file have severly performance degration:

Here is some comparation: We use the same prompt

Complete the following python code:
from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

That's totally nansense!!

Also, we print some logs when we run llama.cpp and here is the details:

----------------------------- reasoning about gguf --------------------------
Complete the following python code:
from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

    # Define a variable to hold the result
    result = False

    # Sort the list of numbers in ascending order
    numbers.sort()

    # Iterate over the sorted list
    for i in range(1, len(numbers)):
        # Check if the difference between the current and previous number is smaller than the given threshold
        if numbers[i] - numbers[i-1] < threshold:
            # If it is, set the result to True and break the loop
            result = True
            break

    # Return the result
    return result

# Test the function
print(has_close_elements([1.----------------------------- reasoning about gguf --------------------------
Log start
main: build = 3735 (df4b7945)
main: built with cc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) for x86_64-redhat-linux
llama_model_loader: loaded meta data with 31 key-value pairs and 292 tensors from /data2/xmwang/deployed_gguf/llama3.1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = llama3.1_8B
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3.1
llama_model_loader: - kv   5:                   general.base_model.count u32              = 1
llama_model_loader: - kv   6:                  general.base_model.0.name str              = Meta Llama 3.1 8B
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Met...
llama_model_loader: - kv   9:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv  10:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv  11:                          llama.block_count u32              = 32
llama_model_loader: - kv  12:                       llama.context_length u32              = 131072
llama_model_loader: - kv  13:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  14:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  15:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  16:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  18:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                          general.file_type u32              = 1
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = llama3.1_8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
  Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
  Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
  Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.68 MiB
llama.cpp, get_tensor_meta, name = token_embd.weight 
llama.cpp, get_tensor_meta, name = output_norm.weight 
llama.cpp, get_tensor_meta, name = output.weight 
llama.cpp, get_tensor_meta, name = blk.0.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.0.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.0.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.0.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.0.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.0.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.0.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.0.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.0.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.0.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.0.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.0.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.0.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.0.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.0.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.0.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.1.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.1.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.1.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.1.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.1.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.1.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.1.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.1.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.1.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.1.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.1.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.1.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.1.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.1.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.1.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.1.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.2.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.2.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.2.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.2.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.2.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.2.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.2.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.2.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.2.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.2.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.2.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.2.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.2.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.2.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.2.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.2.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.3.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.3.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.3.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.3.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.3.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.3.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.3.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.3.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.3.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.3.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.3.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.3.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.3.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.3.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.3.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.3.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.4.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.4.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.4.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.4.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.4.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.4.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.4.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.4.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.4.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.4.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.4.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.4.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.4.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.4.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.4.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.4.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.5.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.5.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.5.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.5.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.5.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.5.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.5.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.5.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.5.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.5.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.5.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.5.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.5.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.5.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.5.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.5.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.6.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.6.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.6.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.6.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.6.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.6.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.6.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.6.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.6.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.6.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.6.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.6.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.6.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.6.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.6.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.6.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.7.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.7.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.7.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.7.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.7.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.7.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.7.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.7.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.7.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.7.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.7.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.7.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.7.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.7.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.7.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.7.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.8.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.8.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.8.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.8.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.8.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.8.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.8.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.8.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.8.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.8.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.8.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.8.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.8.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.8.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.8.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.8.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.9.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.9.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.9.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.9.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.9.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.9.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.9.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.9.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.9.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.9.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.9.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.9.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.9.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.9.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.9.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.9.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.10.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.10.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.10.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.10.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.10.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.10.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.10.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.10.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.10.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.10.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.10.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.10.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.10.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.10.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.10.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.10.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.11.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.11.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.11.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.11.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.11.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.11.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.11.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.11.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.11.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.11.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.11.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.11.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.11.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.11.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.11.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.11.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.12.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.12.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.12.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.12.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.12.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.12.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.12.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.12.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.12.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.12.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.12.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.12.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.12.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.12.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.12.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.12.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.13.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.13.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.13.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.13.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.13.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.13.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.13.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.13.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.13.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.13.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.13.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.13.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.13.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.13.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.13.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.13.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.14.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.14.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.14.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.14.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.14.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.14.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.14.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.14.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.14.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.14.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.14.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.14.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.14.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.14.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.14.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.14.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.15.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.15.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.15.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.15.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.15.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.15.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.15.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.15.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.15.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.15.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.15.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.15.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.15.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.15.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.15.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.15.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.16.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.16.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.16.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.16.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.16.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.16.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.16.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.16.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.16.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.16.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.16.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.16.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.16.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.16.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.16.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.16.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.17.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.17.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.17.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.17.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.17.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.17.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.17.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.17.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.17.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.17.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.17.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.17.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.17.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.17.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.17.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.17.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.18.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.18.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.18.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.18.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.18.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.18.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.18.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.18.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.18.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.18.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.18.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.18.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.18.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.18.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.18.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.18.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.19.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.19.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.19.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.19.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.19.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.19.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.19.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.19.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.19.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.19.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.19.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.19.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.19.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.19.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.19.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.19.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.20.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.20.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.20.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.20.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.20.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.20.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.20.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.20.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.20.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.20.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.20.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.20.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.20.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.20.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.20.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.20.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.21.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.21.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.21.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.21.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.21.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.21.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.21.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.21.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.21.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.21.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.21.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.21.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.21.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.21.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.21.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.21.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.22.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.22.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.22.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.22.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.22.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.22.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.22.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.22.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.22.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.22.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.22.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.22.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.22.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.22.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.22.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.22.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.23.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.23.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.23.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.23.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.23.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.23.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.23.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.23.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.23.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.23.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.23.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.23.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.23.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.23.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.23.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.23.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.24.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.24.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.24.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.24.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.24.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.24.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.24.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.24.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.24.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.24.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.24.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.24.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.24.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.24.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.24.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.24.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.25.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.25.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.25.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.25.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.25.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.25.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.25.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.25.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.25.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.25.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.25.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.25.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.25.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.25.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.25.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.25.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.26.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.26.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.26.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.26.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.26.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.26.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.26.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.26.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.26.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.26.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.26.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.26.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.26.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.26.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.26.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.26.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.27.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.27.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.27.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.27.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.27.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.27.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.27.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.27.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.27.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.27.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.27.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.27.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.27.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.27.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.27.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.27.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.28.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.28.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.28.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.28.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.28.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.28.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.28.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.28.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.28.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.28.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.28.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.28.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.28.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.28.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.28.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.28.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.29.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.29.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.29.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.29.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.29.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.29.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.29.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.29.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.29.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.29.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.29.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.29.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.29.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.29.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.29.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.29.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.30.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.30.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.30.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.30.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.30.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.30.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.30.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.30.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.30.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.30.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.30.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.30.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.30.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.30.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.30.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.30.ffn_up.bias 
llama.cpp, get_tensor_meta, name = blk.31.attn_norm.weight 
llama.cpp, get_tensor_meta, name = blk.31.attn_q.weight 
llama.cpp, get_tensor_meta, name = blk.31.attn_k.weight 
llama.cpp, get_tensor_meta, name = blk.31.attn_v.weight 
llama.cpp, get_tensor_meta, name = blk.31.attn_output.weight 
llama.cpp, get_tensor_meta, name = blk.31.attn_q.bias 
llama.cpp, get_tensor_meta, name = blk.31.attn_k.bias 
llama.cpp, get_tensor_meta, name = blk.31.attn_v.bias 
llama.cpp, get_tensor_meta, name = blk.31.attn_output.bias 
llama.cpp, get_tensor_meta, name = blk.31.ffn_norm.weight 
llama.cpp, get_tensor_meta, name = rope_freqs.weight 
llama.cpp, get_tensor_meta, name = blk.31.ffn_gate.weight 
llama.cpp, get_tensor_meta, name = blk.31.ffn_down.weight 
llama.cpp, get_tensor_meta, name = blk.31.ffn_up.weight 
llama.cpp, get_tensor_meta, name = blk.31.ffn_gate.bias 
llama.cpp, get_tensor_meta, name = blk.31.ffn_down.bias 
llama.cpp, get_tensor_meta, name = blk.31.ffn_up.bias 
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size =  3744.28 MiB
llm_load_tensors:      CUDA1 buffer size =  3328.25 MiB
llm_load_tensors:      CUDA2 buffer size =  3328.25 MiB
llm_load_tensors:      CUDA3 buffer size =  3914.24 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  4608.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  4096.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =  4096.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =  3584.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 9280.01 MiB
ggml_gallocr_reserve_n: reallocating CUDA1 buffer from size 0.00 MiB to 9280.01 MiB
ggml_gallocr_reserve_n: reallocating CUDA2 buffer from size 0.00 MiB to 9280.01 MiB
ggml_gallocr_reserve_n: reallocating CUDA3 buffer from size 0.00 MiB to 9280.02 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 1032.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  9280.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  9280.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  9280.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  9280.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =  1032.02 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 5
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [4096 2 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-9] [4096 2 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-17] [4096 2 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-25] [4096 2 1 1]

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling seed: 3146292980
sampling params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr: 
    logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 131072, n_batch = 2048, n_predict = 128, n_keep = 1

Complete the following python code:
from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [4096 125 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-9] [4096 125 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-17] [4096 125 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-25] [4096 125 1 1]
    for i inggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
 range(len(numbers)):
        for j in range(i+1, len(numbers)):
            if abs(numbers[i] - numbers[j]) < threshold:
                return True
    return False

# Example usage:
numbers = [1.0, 2.0, 3.0, 4.0, 5.0, 2.0]
threshold = 0.5
print(has_close_elements(numbers, threshold))  # Output: False

# Another example with two elements closer than threshold
numbers = [1.0, 2.8, 3.0, 4.
llama_perf_print:    sampling time =      83.43 ms /   253 runs   (    0.33 ms per token,  3032.63 tokens per second)
llama_perf_print:        load time =    4301.15 ms
llama_perf_print: prompt eval time =      32.94 ms /   125 tokens (    0.26 ms per token,  3794.32 tokens per second)
llama_perf_print:        eval time =    2550.09 ms /   127 runs   (   20.08 ms per token,    49.80 tokens per second)
llama_perf_print:       total time =    2766.80 ms /   252 tokens
Log end

Here we note this:

ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [4096 125 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-9] [4096 125 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-17] [4096 125 1 1]
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-25] [4096 125 1 1]

But it's alse emerge in the original gguf models

We wonder if anyone else use llama.cpp to deploy structed pruned model?

Name and Version

./llama-cli --version
version: 3735 (df4b7945)
built with cc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) for x86_64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

ggerganov commented 1 month ago

There is most likely an error during the conversion process. For attention tensors, the KV heads can be in different order and this is easy to get wrong. See the reverse_hf_permute_part calls in the convert script and make sure these make sense for your pruned model.

The CUDA warnings are irrelevant.

gudehhh666 commented 1 month ago

Hi, Thanks for your reminder and we DO find this issue is related to the kv_heads in convert process. We find that when we use structed pruning and get the odd n_head_kv, the model fails to response the query appropiately, however when the number of n_head_kv is even, the performance is OK. In convert_hf_to_gguf.py we find some clues: Here's def _reverse_hf_permute:

def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor:
        if n_kv_head is not None and n_head != n_kv_head:
            n_head //= n_kv_head

        return (
            weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
            .swapaxes(1, 2)
            .reshape(weights.shape)
        )

What does this function do for? We can't figure it out, as we notice that for dim2, these's weights.shape[0] // n_head // 2 , we can't understand why here apply //2 to the weights and then reshape them to the origin shape

Looking forward someone to help us figure it out.

gudehhh666 commented 1 month ago

We have uploaded our pruned model to huggingface: PeterKKQ/llama3.1_cutting_0.2_4-30 llama3.1_cutting_0.2_4-30 Anyone who are interested in this issue could download the model and try to convert it to .gguf to help us figure it out!