ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.17k stars 9.34k forks source link

Wrong number of tensors when run inference #5518

Closed hoaileba closed 7 months ago

hoaileba commented 7 months ago

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug. I have converted GPT2 into gguf and quantize model into q4_0

/quantize models/gpt_f16.gguf  models/gpt2_q4_0.gguf q4_0
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
main: build = 2074 (098f6d73)
main: built with cc (Ubuntu 10.3.0-1ubuntu1~18.04~1) 10.3.0 for x86_64-linux-gnu
main: quantizing 'models/gpt_f16.gguf' to 'models/gpt2_q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 15 key-value pairs and 150 tensors from models/gpt_f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt2
llama_model_loader: - kv   1:                               general.name str              = gpt2-base-ver7
llama_model_loader: - kv   2:                           gpt2.block_count u32              = 12
llama_model_loader: - kv   3:                        gpt2.context_length u32              = 1024
llama_model_loader: - kv   4:                      gpt2.embedding_length u32              = 768
llama_model_loader: - kv   5:                   gpt2.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:                  gpt2.attention.head_count u32              = 12
llama_model_loader: - kv   7:          gpt2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,50259]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,50259]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - type  f32:   98 tensors
llama_model_loader: - type  f16:   52 tensors
llama_model_quantize_internal ============ Strange model: n_attention_wv = 12, n_ffn_down = 24, hparams.n_layer = 12
llama_model_quantize_internal: meta size = 1775136 bytes
[   1/ 150]                    token_embd.weight - [  768, 50259,     1,     1], type =    f16, quantizing to q4_0 .. size =    73.62 MiB ->    20.71 MiB | hist: 0.035 0.019 0.017 0.046 0.044 0.081 0.105 0.136 0.222 0.125 0.069 0.042 0.024 0.016 0.012 0.007 
[   2/ 150]                        output.weight - [  768, 50259,     1,     1], type =    f16, quantizing to q6_K .. size =    73.62 MiB ->    30.20 MiB
[   3/ 150]                 position_embd.weight - [  768,  1024,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.50 MiB ->     0.42 MiB | hist: 0.035 0.011 0.017 0.025 0.036 0.049 0.064 0.085 0.393 0.085 0.063 0.048 0.036 0.024 0.016 0.014 
[   4/ 150]               blk.0.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[   5/ 150]                 blk.0.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[   6/ 150]                blk.0.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.013 0.020 0.031 0.046 0.067 0.096 0.131 0.151 0.130 0.096 0.067 0.047 0.031 0.021 0.018 
[   7/ 150]                  blk.0.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[   8/ 150]             blk.0.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.096 0.114 0.121 0.114 0.096 0.076 0.055 0.038 0.024 0.020 
[   9/ 150]               blk.0.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  10/ 150]                blk.0.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  11/ 150]                  blk.0.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  12/ 150]                  blk.0.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  13/ 150]                    blk.0.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  14/ 150]                blk.0.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.115 0.122 0.114 0.097 0.075 0.054 0.037 0.024 0.020 
[  15/ 150]                  blk.0.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  16/ 150]               blk.1.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  17/ 150]                 blk.1.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  18/ 150]                blk.1.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.037 0.053 0.074 0.097 0.116 0.126 0.117 0.097 0.075 0.054 0.036 0.024 0.020 
[  19/ 150]                  blk.1.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  20/ 150]             blk.1.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.024 0.037 0.054 0.075 0.097 0.114 0.122 0.115 0.097 0.075 0.055 0.037 0.025 0.020 
[  21/ 150]               blk.1.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  22/ 150]                blk.1.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  23/ 150]                  blk.1.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  24/ 150]                  blk.1.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.113 0.097 0.076 0.055 0.038 0.025 0.021 
[  25/ 150]                    blk.1.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  26/ 150]                blk.1.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.035 0.010 0.014 0.021 0.031 0.048 0.084 0.161 0.224 0.161 0.084 0.048 0.031 0.021 0.014 0.013 
[  27/ 150]                  blk.1.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  28/ 150]               blk.2.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  29/ 150]                 blk.2.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  30/ 150]                blk.2.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.037 0.054 0.075 0.096 0.116 0.126 0.116 0.097 0.075 0.054 0.037 0.024 0.019 
[  31/ 150]                  blk.2.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  32/ 150]             blk.2.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.076 0.096 0.111 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  33/ 150]               blk.2.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  34/ 150]                blk.2.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  35/ 150]                  blk.2.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  36/ 150]                  blk.2.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.099 0.115 0.121 0.114 0.096 0.074 0.053 0.036 0.023 0.019 
[  37/ 150]                    blk.2.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  38/ 150]                blk.2.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.014 0.023 0.035 0.052 0.073 0.098 0.121 0.133 0.121 0.096 0.072 0.051 0.034 0.022 0.019 
[  39/ 150]                  blk.2.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  40/ 150]               blk.3.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  41/ 150]                 blk.3.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  42/ 150]                blk.3.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.097 0.115 0.124 0.115 0.097 0.075 0.055 0.037 0.024 0.020 
[  43/ 150]                  blk.3.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  44/ 150]             blk.3.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.097 0.110 0.115 0.110 0.096 0.077 0.057 0.039 0.026 0.021 
[  45/ 150]               blk.3.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  46/ 150]                blk.3.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  47/ 150]                  blk.3.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  48/ 150]                  blk.3.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.014 0.023 0.036 0.054 0.076 0.099 0.117 0.124 0.117 0.098 0.075 0.053 0.035 0.023 0.019 
[  49/ 150]                    blk.3.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  50/ 150]                blk.3.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.115 0.123 0.115 0.097 0.075 0.054 0.037 0.024 0.020 
[  51/ 150]                  blk.3.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  52/ 150]               blk.4.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  53/ 150]                 blk.4.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  54/ 150]                blk.4.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.024 0.038 0.056 0.076 0.096 0.114 0.122 0.114 0.096 0.076 0.055 0.038 0.025 0.020 
[  55/ 150]                  blk.4.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  56/ 150]             blk.4.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.096 0.111 0.115 0.110 0.097 0.077 0.057 0.039 0.025 0.021 
[  57/ 150]               blk.4.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  58/ 150]                blk.4.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  59/ 150]                  blk.4.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  60/ 150]                  blk.4.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.014 0.023 0.036 0.054 0.076 0.098 0.116 0.123 0.116 0.098 0.076 0.054 0.036 0.023 0.019 
[  61/ 150]                    blk.4.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  62/ 150]                blk.4.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.113 0.097 0.076 0.056 0.038 0.025 0.020 
[  63/ 150]                  blk.4.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  64/ 150]               blk.5.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  65/ 150]                 blk.5.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  66/ 150]                blk.5.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.120 0.114 0.097 0.076 0.055 0.038 0.025 0.020 
[  67/ 150]                  blk.5.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  68/ 150]             blk.5.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.036 0.016 0.026 0.040 0.056 0.077 0.096 0.111 0.115 0.110 0.096 0.077 0.057 0.040 0.026 0.021 
[  69/ 150]               blk.5.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  70/ 150]                blk.5.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  71/ 150]                  blk.5.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  72/ 150]                  blk.5.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.098 0.115 0.122 0.115 0.098 0.076 0.055 0.037 0.024 0.020 
[  73/ 150]                    blk.5.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  74/ 150]                blk.5.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  75/ 150]                  blk.5.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  76/ 150]               blk.6.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  77/ 150]                 blk.6.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  78/ 150]                blk.6.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.113 0.120 0.113 0.096 0.076 0.056 0.038 0.025 0.020 
[  79/ 150]                  blk.6.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  80/ 150]             blk.6.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.115 0.111 0.096 0.077 0.057 0.040 0.026 0.021 
[  81/ 150]               blk.6.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  82/ 150]                blk.6.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  83/ 150]                  blk.6.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  84/ 150]                  blk.6.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.098 0.114 0.121 0.115 0.098 0.076 0.055 0.037 0.024 0.020 
[  85/ 150]                    blk.6.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  86/ 150]                blk.6.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
[  87/ 150]                  blk.6.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  88/ 150]               blk.7.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  89/ 150]                 blk.7.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  90/ 150]                blk.7.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.016 0.025 0.038 0.056 0.076 0.096 0.113 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.020 
[  91/ 150]                  blk.7.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[  92/ 150]             blk.7.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.036 0.016 0.026 0.039 0.057 0.077 0.096 0.111 0.115 0.110 0.096 0.077 0.058 0.039 0.025 0.021 
[  93/ 150]               blk.7.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  94/ 150]                blk.7.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  95/ 150]                  blk.7.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[  96/ 150]                  blk.7.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.098 0.114 0.120 0.113 0.097 0.076 0.055 0.037 0.024 0.020 
[  97/ 150]                    blk.7.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[  98/ 150]                blk.7.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  99/ 150]                  blk.7.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 100/ 150]               blk.8.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 101/ 150]                 blk.8.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 102/ 150]                blk.8.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 103/ 150]                  blk.8.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 104/ 150]             blk.8.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.115 0.110 0.097 0.077 0.057 0.039 0.026 0.021 
[ 105/ 150]               blk.8.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 106/ 150]                blk.8.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 107/ 150]                  blk.8.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 108/ 150]                  blk.8.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.055 0.076 0.098 0.113 0.119 0.113 0.097 0.077 0.056 0.038 0.024 0.020 
[ 109/ 150]                    blk.8.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 110/ 150]                blk.8.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
[ 111/ 150]                  blk.8.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 112/ 150]               blk.9.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 113/ 150]                 blk.9.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 114/ 150]                blk.9.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.016 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.113 0.097 0.076 0.056 0.038 0.025 0.021 
[ 115/ 150]                  blk.9.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 116/ 150]             blk.9.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.039 0.056 0.077 0.096 0.110 0.115 0.110 0.096 0.078 0.057 0.040 0.026 0.021 
[ 117/ 150]               blk.9.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 118/ 150]                blk.9.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 119/ 150]                  blk.9.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 120/ 150]                  blk.9.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.119 0.113 0.097 0.076 0.056 0.038 0.025 0.020 
[ 121/ 150]                    blk.9.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 122/ 150]                blk.9.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.113 0.118 0.112 0.096 0.076 0.056 0.038 0.025 0.020 
[ 123/ 150]                  blk.9.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 124/ 150]              blk.10.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 125/ 150]                blk.10.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 126/ 150]               blk.10.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
[ 127/ 150]                 blk.10.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 128/ 150]            blk.10.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.096 0.111 0.115 0.110 0.096 0.076 0.057 0.039 0.025 0.021 
[ 129/ 150]              blk.10.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 130/ 150]               blk.10.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 131/ 150]                 blk.10.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 132/ 150]                 blk.10.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[ 133/ 150]                   blk.10.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 134/ 150]               blk.10.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.038 0.056 0.076 0.097 0.114 0.121 0.113 0.097 0.076 0.055 0.038 0.024 0.020 
[ 135/ 150]                 blk.10.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 136/ 150]              blk.11.attn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 137/ 150]                blk.11.attn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 138/ 150]               blk.11.attn_qkv.weight - [  768,  2304,     1,     1], type =    f16, quantizing to q4_0 .. size =     3.38 MiB ->     0.95 MiB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.096 0.076 0.056 0.038 0.025 0.021 
[ 139/ 150]                 blk.11.attn_qkv.bias - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 140/ 150]            blk.11.attn_output.weight - [  768,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     1.12 MiB ->     0.32 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.076 0.096 0.111 0.116 0.111 0.097 0.076 0.057 0.039 0.025 0.021 
[ 141/ 150]              blk.11.attn_output.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 142/ 150]               blk.11.ffn_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 143/ 150]                 blk.11.ffn_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 144/ 150]                 blk.11.ffn_up.weight - [  768,  3072,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.037 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[ 145/ 150]                   blk.11.ffn_up.bias - [ 3072,     1,     1,     1], type =    f32, size =    0.012 MB
[ 146/ 150]               blk.11.ffn_down.weight - [ 3072,   768,     1,     1], type =    f16, quantizing to q4_0 .. size =     4.50 MiB ->     1.27 MiB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.098 0.115 0.123 0.115 0.097 0.076 0.055 0.037 0.024 0.019 
[ 147/ 150]                 blk.11.ffn_down.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 148/ 150]                   output_norm.weight - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 149/ 150]                     output_norm.bias - [  768,     1,     1,     1], type =    f32, size =    0.003 MB
[ 150/ 150]                        output.weight - [  768, 50259,     1,     1], type =    f16, quantizing to q6_K .. size =    73.62 MiB ->    30.20 MiB
llama_model_quantize_internal: model size  =   384.83 MB
llama_model_quantize_internal: quant size  =   127.55 MB
llama_model_quantize_internal: hist: 0.036 0.016 0.022 0.040 0.051 0.077 0.099 0.121 0.156 0.118 0.088 0.065 0.045 0.030 0.020 0.016 
main: quantize time =  1114.07 ms
main:    total time =  1114.07 ms

but when i run inference i meet this bug

CUDA_VISIBLE_DEVICES=0  ./main -m models/gpt2_q4_0.gguf --n-gpu-layers 80 -t 16 --color -c 2048 --temp 0.75 --repeat_penalty 1.0 -n -1 -p "<startofstring> [question]: What is AI?n[Response]:"
Log start
main: build = 2074 (098f6d73)
main: built with cc (Ubuntu 10.3.0-1ubuntu1~18.04~1) 10.3.0 for x86_64-linux-gnu
main: seed  = 1708059505
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 150 tensors from models/gpt2_q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt2
llama_model_loader: - kv   1:                               general.name str              = gpt2-base-ver7
llama_model_loader: - kv   2:                           gpt2.block_count u32              = 12
llama_model_loader: - kv   3:                        gpt2.context_length u32              = 1024
llama_model_loader: - kv   4:                      gpt2.embedding_length u32              = 768
llama_model_loader: - kv   5:                   gpt2.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:                  gpt2.attention.head_count u32              = 12
llama_model_loader: - kv   7:          gpt2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:                          general.file_type u32              = 2
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,50259]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,50259]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   98 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q4_0:   50 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 3/50259 vs 2/50259 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gpt2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 50259
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 1024
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 1024
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 0.1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 201.64 M
llm_load_print_meta: model size       = 127.55 MiB (5.31 BPW) 
llm_load_print_meta: general.name     = gpt2-base-ver7
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.11 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 150, got 149
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/gpt2_q4_0.gguf'
main: error: unable to load model

do I make any mistakes? and How can I fix this please?

uwu-420 commented 4 months ago

Hi @hoaileba

I just ran into the same issue.

I'm curious, did you find a solution and want to share it?

uwu-420 commented 4 months ago

Removing this part did the trick for me.

https://github.com/ggerganov/llama.cpp/blob/3fe0596c1817a6114ffffb6dbfd6c36ca7815dc7/convert-hf-to-gguf.py#L1967C1-L1970C67

FYI @manikbhandari