Wrong number of tensors when run inference

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug. I have converted GPT2 into gguf and quantize model into q4_0

/quantize models/gpt_f16.gguf  models/gpt2_q4_0.gguf q4_0
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
main: build = 2074 (098f6d73)
main: built with cc (Ubuntu 10.3.0-1ubuntu1~18.04~1) 10.3.0 for x86_64-linux-gnu
main: quantizing 'models/gpt_f16.gguf' to 'models/gpt2_q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 15 key-value pairs and 150 tensors from models/gpt_f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt2
llama_model_loader: - kv   1:                               general.name str              = gpt2-base-ver7
llama_model_loader: - kv   2:                           gpt2.block_count u32              = 12
llama_model_loader: - kv   3:                        gpt2.context_length u32              = 1024
llama_model_loader: - kv   4:                      gpt2.embedding_length u32              = 768
llama_model_loader: - kv   5:                   gpt2.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:                  gpt2.attention.head_count u32              = 12
llama_model_loader: - kv   7:          gpt2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,50259]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,50259]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - type  f32:   98 tensors
llama_model_loader: - type  f16:   52 tensors
llama_model_quantize_internal ============ Strange model: n_attention_wv = 12, n_ffn_down = 24, hparams.n_layer = 12
llama_model_quantize_internal: meta size = 1775136 bytes
[   1/ 150]                    token_embd.weight - [  768, 50259,     1,     1], type =    f16, quantizing to q4_0 .. size =    73.62 MiB ->    20.71 MiB | hist: 0.035 0.019 0.017 0.046 0.044 0.081 0.105 0.136 0.222 0.125 0.069 0.042 0.024 0.016 0.012 0.007 
[   2/ 150]                        output.weight - [  768, 50259,     1,     1], type =    f16, quantizing to q6_K .. size =    73.62 MiB ->    30.20 MiB
[   3/ 150]                 position_embd.weight - [  768,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.50 MiB ->     0.42 MiB | hist: 0.035 0.011 0.017 0.025 0.036 0.049 0.064 0.085 0.393 0.085 0.063 0.048 0.036 0.024 0.016 0.014 
[   4/ 150]               blk.0.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[   5/ 150]                 blk.0.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[   6/ 150]                blk.0.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.013 0.020 0.031 0.046 0.067 0.096 0.131 0.151 0.130 0.096 0.067 0.047 0.031 0.021 0.018 
[   7/ 150]                  blk.0.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[   8/ 150]             blk.0.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.096 0.114 0.121 0.114 0.096 0.076 0.055 0.038 0.024 0.020 
[   9/ 150]               blk.0.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  10/ 150]                blk.0.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  11/ 150]                  blk.0.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  12/ 150]                  blk.0.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  13/ 150]                    blk.0.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  14/ 150]                blk.0.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.115 0.122 0.114 0.097 0.075 0.054 0.037 0.024 0.020 
[  15/ 150]                  blk.0.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  16/ 150]               blk.1.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  17/ 150]                 blk.1.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  18/ 150]                blk.1.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.037 0.053 0.074 0.097 0.116 0.126 0.117 0.097 0.075 0.054 0.036 0.024 0.020 
[  19/ 150]                  blk.1.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  20/ 150]             blk.1.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.024 0.037 0.054 0.075 0.097 0.114 0.122 0.115 0.097 0.075 0.055 0.037 0.025 0.020 
[  21/ 150]               blk.1.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  22/ 150]                blk.1.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  23/ 150]                  blk.1.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  24/ 150]                  blk.1.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.113 0.097 0.076 0.055 0.038 0.025 0.021 
[  25/ 150]                    blk.1.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  26/ 150]                blk.1.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.035 0.010 0.014 0.021 0.031 0.048 0.084 0.161 0.224 0.161 0.084 0.048 0.031 0.021 0.014 0.013 
[  27/ 150]                  blk.1.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  28/ 150]               blk.2.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  29/ 150]                 blk.2.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  30/ 150]                blk.2.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.096 0.116 0.126 0.116 0.097 0.075 0.054 0.037 0.024 0.019 
[  31/ 150]                  blk.2.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  32/ 150]             blk.2.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.076 0.096 0.111 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  33/ 150]               blk.2.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  34/ 150]                blk.2.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  35/ 150]                  blk.2.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  36/ 150]                  blk.2.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.099 0.115 0.121 0.114 0.096 0.074 0.053 0.036 0.023 0.019 
[  37/ 150]                    blk.2.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  38/ 150]                blk.2.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.014 0.023 0.035 0.052 0.073 0.098 0.121 0.133 0.121 0.096 0.072 0.051 0.034 0.022 0.019 
[  39/ 150]                  blk.2.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  40/ 150]               blk.3.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  41/ 150]                 blk.3.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  42/ 150]                blk.3.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.097 0.115 0.124 0.115 0.097 0.075 0.055 0.037 0.024 0.020 
[  43/ 150]                  blk.3.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  44/ 150]             blk.3.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.097 0.110 0.115 0.110 0.096 0.077 0.057 0.039 0.026 0.021 
[  45/ 150]               blk.3.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  46/ 150]                blk.3.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  47/ 150]                  blk.3.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  48/ 150]                  blk.3.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.014 0.023 0.036 0.054 0.076 0.099 0.117 0.124 0.117 0.098 0.075 0.053 0.035 0.023 0.019 
[  49/ 150]                    blk.3.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  50/ 150]                blk.3.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.115 0.123 0.115 0.097 0.075 0.054 0.037 0.024 0.020 
[  51/ 150]                  blk.3.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  52/ 150]               blk.4.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  53/ 150]                 blk.4.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  54/ 150]                blk.4.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.038 0.056 0.076 0.096 0.114 0.122 0.114 0.096 0.076 0.055 0.038 0.025 0.020 
[  55/ 150]                  blk.4.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  56/ 150]             blk.4.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.096 0.111 0.115 0.110 0.097 0.077 0.057 0.039 0.025 0.021 
[  57/ 150]               blk.4.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  58/ 150]                blk.4.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  59/ 150]                  blk.4.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  60/ 150]                  blk.4.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.014 0.023 0.036 0.054 0.076 0.098 0.116 0.123 0.116 0.098 0.076 0.054 0.036 0.023 0.019 
[  61/ 150]                    blk.4.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  62/ 150]                blk.4.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.113 0.097 0.076 0.056 0.038 0.025 0.020 
[  63/ 150]                  blk.4.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  64/ 150]               blk.5.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  65/ 150]                 blk.5.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  66/ 150]                blk.5.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.120 0.114 0.097 0.076 0.055 0.038 0.025 0.020 
[  67/ 150]                  blk.5.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  68/ 150]             blk.5.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.036 0.016 0.026 0.040 0.056 0.077 0.096 0.111 0.115 0.110 0.096 0.077 0.057 0.040 0.026 0.021 
[  69/ 150]               blk.5.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  70/ 150]                blk.5.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  71/ 150]                  blk.5.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  72/ 150]                  blk.5.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.098 0.115 0.122 0.115 0.098 0.076 0.055 0.037 0.024 0.020 
[  73/ 150]                    blk.5.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  74/ 150]                blk.5.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  75/ 150]                  blk.5.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  76/ 150]               blk.6.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  77/ 150]                 blk.6.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  78/ 150]                blk.6.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.113 0.120 0.113 0.096 0.076 0.056 0.038 0.025 0.020 
[  79/ 150]                  blk.6.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  80/ 150]             blk.6.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.115 0.111 0.096 0.077 0.057 0.040 0.026 0.021 
[  81/ 150]               blk.6.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  82/ 150]                blk.6.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  83/ 150]                  blk.6.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  84/ 150]                  blk.6.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.098 0.114 0.121 0.115 0.098 0.076 0.055 0.037 0.024 0.020 
[  85/ 150]                    blk.6.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  86/ 150]                blk.6.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
[  87/ 150]                  blk.6.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  88/ 150]               blk.7.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  89/ 150]                 blk.7.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  90/ 150]                blk.7.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.016 0.025 0.038 0.056 0.076 0.096 0.113 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.020 
[  91/ 150]                  blk.7.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  92/ 150]             blk.7.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.036 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.115 0.110 0.096 0.077 0.058 0.039 0.025 0.021 
[  93/ 150]               blk.7.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  94/ 150]                blk.7.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  95/ 150]                  blk.7.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  96/ 150]                  blk.7.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.098 0.114 0.120 0.113 0.097 0.076 0.055 0.037 0.024 0.020 
[  97/ 150]                    blk.7.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  98/ 150]                blk.7.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  99/ 150]                  blk.7.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 100/ 150]               blk.8.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 101/ 150]                 blk.8.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 102/ 150]                blk.8.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 103/ 150]                  blk.8.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 104/ 150]             blk.8.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.115 0.110 0.097 0.077 0.057 0.039 0.026 0.021 
[ 105/ 150]               blk.8.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 106/ 150]                blk.8.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 107/ 150]                  blk.8.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 108/ 150]                  blk.8.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.055 0.076 0.098 0.113 0.119 0.113 0.097 0.077 0.056 0.038 0.024 0.020 
[ 109/ 150]                    blk.8.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 110/ 150]                blk.8.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
[ 111/ 150]                  blk.8.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 112/ 150]               blk.9.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 113/ 150]                 blk.9.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 114/ 150]                blk.9.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.016 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.113 0.097 0.076 0.056 0.038 0.025 0.021 
[ 115/ 150]                  blk.9.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 116/ 150]             blk.9.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.039 0.056 0.077 0.096 0.110 0.115 0.110 0.096 0.078 0.057 0.040 0.026 0.021 
[ 117/ 150]               blk.9.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 118/ 150]                blk.9.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 119/ 150]                  blk.9.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 120/ 150]                  blk.9.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.119 0.113 0.097 0.076 0.056 0.038 0.025 0.020 
[ 121/ 150]                    blk.9.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 122/ 150]                blk.9.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.113 0.118 0.112 0.096 0.076 0.056 0.038 0.025 0.020 
[ 123/ 150]                  blk.9.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 124/ 150]              blk.10.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 125/ 150]                blk.10.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 126/ 150]               blk.10.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
[ 127/ 150]                 blk.10.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 128/ 150]            blk.10.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.096 0.111 0.115 0.110 0.096 0.076 0.057 0.039 0.025 0.021 
[ 129/ 150]              blk.10.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 130/ 150]               blk.10.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 131/ 150]                 blk.10.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 132/ 150]                 blk.10.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[ 133/ 150]                   blk.10.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 134/ 150]               blk.10.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.056 0.076 0.097 0.114 0.121 0.113 0.097 0.076 0.055 0.038 0.024 0.020 
[ 135/ 150]                 blk.10.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 136/ 150]              blk.11.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 137/ 150]                blk.11.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 138/ 150]               blk.11.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.096 0.076 0.056 0.038 0.025 0.021 
[ 139/ 150]                 blk.11.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 140/ 150]            blk.11.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.076 0.096 0.111 0.116 0.111 0.097 0.076 0.057 0.039 0.025 0.021 
[ 141/ 150]              blk.11.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 142/ 150]               blk.11.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 143/ 150]                 blk.11.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 144/ 150]                 blk.11.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.037 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[ 145/ 150]                   blk.11.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 146/ 150]               blk.11.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.098 0.115 0.123 0.115 0.097 0.076 0.055 0.037 0.024 0.019 
[ 147/ 150]                 blk.11.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 148/ 150]                   output_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 149/ 150]                     output_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 150/ 150]                        output.weight - [  768, 50259,     1,     1], type =    f16, quantizing to q6_K .. size =    73.62 MiB ->    30.20 MiB
llama_model_quantize_internal: model size  =   384.83 MB
llama_model_quantize_internal: quant size  =   127.55 MB
llama_model_quantize_internal: hist: 0.036 0.016 0.022 0.040 0.051 0.077 0.099 0.121 0.156 0.118 0.088 0.065 0.045 0.030 0.020 0.016 
main: quantize time =  1114.07 ms
main:    total time =  1114.07 ms

but when i run inference i meet this bug

CUDA_VISIBLE_DEVICES=0  ./main -m models/gpt2_q4_0.gguf --n-gpu-layers 80 -t 16 --color -c 2048 --temp 0.75 --repeat_penalty 1.0 -n -1 -p "<startofstring> [question]: What is AI?n[Response]:"
Log start
main: build = 2074 (098f6d73)
main: built with cc (Ubuntu 10.3.0-1ubuntu1~18.04~1) 10.3.0 for x86_64-linux-gnu
main: seed  = 1708059505
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 150 tensors from models/gpt2_q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt2
llama_model_loader: - kv   1:                               general.name str              = gpt2-base-ver7
llama_model_loader: - kv   2:                           gpt2.block_count u32              = 12
llama_model_loader: - kv   3:                        gpt2.context_length u32              = 1024
llama_model_loader: - kv   4:                      gpt2.embedding_length u32              = 768
llama_model_loader: - kv   5:                   gpt2.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:                  gpt2.attention.head_count u32              = 12
llama_model_loader: - kv   7:          gpt2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:                          general.file_type u32              = 2
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,50259]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,50259]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   98 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q4_0:   50 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 3/50259 vs 2/50259 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gpt2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 50259
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 1024
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 1024
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 0.1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 201.64 M
llm_load_print_meta: model size       = 127.55 MiB (5.31 BPW) 
llm_load_print_meta: general.name     = gpt2-base-ver7
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.11 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 150, got 149
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/gpt2_q4_0.gguf'
main: error: unable to load model

do I make any mistakes? and How can I fix this please?

ggerganov / llama.cpp

Wrong number of tensors when run inference #5518