"gpt_tokenize: unknown token" running RedPajama

I'm hitting an error while running RedPajama. It's likely the result of a misunderstanding on my part, so I'm hoping somebody can shed some light on what I'm doing wrong.

To begin with, I've cloned ggml from commit 74705055853f7922e9622bdd0a1ebde2b8f57431. I build with gcc 9.4.0 on Linux x86:

mkdir build; cd build; cmake ..; make -j 12

This completes without error. I've already cloned https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1, so proceed to ggml conversion:

$ python examples/gpt-neox/convert-h5-to-ggml.py /tmp/RedPajama-INCITE-Base-3B-v1-HEAD/ 0
gpt_neox.embed_in.weight torch.Size([50432, 2560]) torch.float32
gpt_neox.layers.0.input_layernorm.weight torch.Size([2560]) torch.float32
gpt_neox.layers.0.input_layernorm.bias torch.Size([2560]) torch.float32
gpt_neox.layers.0.post_attention_layernorm.weight torch.Size([2560]) torch.float32
gpt_neox.layers.0.post_attention_layernorm.bias torch.Size([2560]) torch.float32
gpt_neox.layers.0.attention.bias torch.Size([1, 1, 2048, 2048]) torch.bool
gpt_neox.layers.0.attention.masked_bias torch.Size([]) torch.float32
gpt_neox.layers.0.attention.rotary_emb.inv_freq torch.Size([40]) torch.float32
gpt_neox.layers.0.attention.query_key_value.weight torch.Size([7680, 2560]) torch.float32
..... snip .....
gpt_neox.layers.31.attention.query_key_value.weight torch.Size([7680, 2560]) torch.float32
gpt_neox.layers.31.attention.query_key_value.bias torch.Size([7680]) torch.float32
gpt_neox.layers.31.attention.dense.weight torch.Size([2560, 2560]) torch.float32
gpt_neox.layers.31.attention.dense.bias torch.Size([2560]) torch.float32
gpt_neox.layers.31.mlp.dense_h_to_4h.weight torch.Size([10240, 2560]) torch.float32
gpt_neox.layers.31.mlp.dense_h_to_4h.bias torch.Size([10240]) torch.float32
gpt_neox.layers.31.mlp.dense_4h_to_h.weight torch.Size([2560, 10240]) torch.float32
gpt_neox.layers.31.mlp.dense_4h_to_h.bias torch.Size([2560]) torch.float32
gpt_neox.final_layer_norm.weight torch.Size([2560]) torch.float32
gpt_neox.final_layer_norm.bias torch.Size([2560]) torch.float32
embed_out.weight torch.Size([50432, 2560]) torch.float32
{'_name_or_path': 'rp_3b_800b', 'architectures': ['GPTNeoXForCausalLM'], 'bos_token_id': 0, 'eos_token_id': 0, 'hidden_act': 'gelu', 'hidden_size': 2560, 'initializer_range': 0.02, 'intermediate_size': 10240, 'layer_norm_eps': 1e-05, 'max_position_embeddings': 2048, 'model_type': 'gpt_neox', 'num_attention_heads': 32, 'num_hidden_layers': 32, 'rotary_emb_base': 10000, 'rotary_pct': 1.0, 'tie_word_embeddings': False, 'torch_dtype': 'float16', 'transformers_version': '4.28.1', 'use_cache': True, 'use_parallel_residual': False, 'vocab_size': 50432}
Processing variable: gpt_neox.embed_in.weight with shape:  (50432, 2560)
Processing variable: gpt_neox.layers.0.input_layernorm.weight with shape:  (2560,)
Processing variable: gpt_neox.layers.0.input_layernorm.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.0.post_attention_layernorm.weight with shape:  (2560,)
Processing variable: gpt_neox.layers.0.post_attention_layernorm.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.0.attention.bias with shape:  (2048, 2048)
  Skipping variable: gpt_neox.layers.0.attention.bias
Processing variable: gpt_neox.layers.0.attention.masked_bias with shape:  ()
  Skipping variable: gpt_neox.layers.0.attention.masked_bias
Processing variable: gpt_neox.layers.0.attention.rotary_emb.inv_freq with shape:  (40,)
  Skipping variable: gpt_neox.layers.0.attention.rotary_emb.inv_freq
Processing variable: gpt_neox.layers.0.attention.query_key_value.weight with shape:  (7680, 2560)
Processing variable: gpt_neox.layers.0.attention.query_key_value.bias with shape:  (7680,)
Processing variable: gpt_neox.layers.0.attention.dense.weight with shape:  (2560, 2560)
Processing variable: gpt_neox.layers.0.attention.dense.bias with shape:  (2560,)
.... snip ....
Processing variable: gpt_neox.layers.31.attention.rotary_emb.inv_freq with shape:  (40,)
  Skipping variable: gpt_neox.layers.31.attention.rotary_emb.inv_freq
Processing variable: gpt_neox.layers.31.attention.query_key_value.weight with shape:  (7680, 2560)
Processing variable: gpt_neox.layers.31.attention.query_key_value.bias with shape:  (7680,)
Processing variable: gpt_neox.layers.31.attention.dense.weight with shape:  (2560, 2560)
Processing variable: gpt_neox.layers.31.attention.dense.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.31.mlp.dense_h_to_4h.weight with shape:  (10240, 2560)
Processing variable: gpt_neox.layers.31.mlp.dense_h_to_4h.bias with shape:  (10240,)
Processing variable: gpt_neox.layers.31.mlp.dense_4h_to_h.weight with shape:  (2560, 10240)
Processing variable: gpt_neox.layers.31.mlp.dense_4h_to_h.bias with shape:  (2560,)
Processing variable: gpt_neox.final_layer_norm.weight with shape:  (2560,)
Processing variable: gpt_neox.final_layer_norm.bias with shape:  (2560,)
Processing variable: embed_out.weight with shape:  (50432, 2560)
Done. Output file: /tmp/ggml-model-f32.bin

Next, I quantize the model:

$ gpt-neox-quantize /tmp/RedPajama-INCITE-Base-3B-v1-HEAD/ggml-model-f32.bin /tmp/q4_0.bin "q4_0"
gpt_neox_model_quantize: loading model from '/tmp/ggml-model-f32.bin'
gpt_neox_model_quantize: n_vocab     = 50432
gpt_neox_model_quantize: n_ctx       = 2048
gpt_neox_model_quantize: n_embd      = 2560
gpt_neox_model_quantize: n_head      = 32
gpt_neox_model_quantize: n_layer     = 32
gpt_neox_model_quantize: par_res     = 0
gpt_neox_model_quantize: ftype (src) = 0
gpt_neox_model_quantize: qntvr (src) = 0
gpt_neox_model_quantize: ftype (dst) = 1002
gpt_neox_model_quantize: qntvr (dst) = 1
                                        gpt_neox.embed_in.weight - [ 2560, 50432,     1], type =    f32 size =   492.50 MB ->    76.95 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
                        gpt_neox.layers.0.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.0.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
               gpt_neox.layers.0.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                 gpt_neox.layers.0.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.0.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
                gpt_neox.layers.0.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                        gpt_neox.layers.0.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.013 0.021 0.033 0.051 0.074 0.099 0.122 0.132 0.122 0.099 0.074 0.051 0.033 0.021 0.017 
                          gpt_neox.layers.0.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                      gpt_neox.layers.0.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                        gpt_neox.layers.0.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                      gpt_neox.layers.0.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                        gpt_neox.layers.0.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                        gpt_neox.layers.1.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.1.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
               gpt_neox.layers.1.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                 gpt_neox.layers.1.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.1.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
                gpt_neox.layers.1.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                        gpt_neox.layers.1.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                          gpt_neox.layers.1.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                      gpt_neox.layers.1.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                        gpt_neox.layers.1.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                      gpt_neox.layers.1.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                        gpt_neox.layers.1.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                        gpt_neox.layers.2.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.2.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
.... snip ....
              gpt_neox.layers.30.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                gpt_neox.layers.30.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
             gpt_neox.layers.30.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
               gpt_neox.layers.30.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                       gpt_neox.layers.30.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
                         gpt_neox.layers.30.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                     gpt_neox.layers.30.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                       gpt_neox.layers.30.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                     gpt_neox.layers.30.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.014 0.022 0.035 0.053 0.075 0.099 0.118 0.126 0.118 0.099 0.075 0.053 0.035 0.022 0.018 
                       gpt_neox.layers.30.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                       gpt_neox.layers.31.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                         gpt_neox.layers.31.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.31.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                gpt_neox.layers.31.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
             gpt_neox.layers.31.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.113 0.097 0.076 0.055 0.038 0.025 0.020 
               gpt_neox.layers.31.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                       gpt_neox.layers.31.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
                         gpt_neox.layers.31.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                     gpt_neox.layers.31.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                       gpt_neox.layers.31.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                     gpt_neox.layers.31.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.014 0.022 0.035 0.052 0.074 0.099 0.120 0.129 0.120 0.099 0.074 0.052 0.035 0.022 0.018 
                       gpt_neox.layers.31.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                gpt_neox.final_layer_norm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                  gpt_neox.final_layer_norm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                                embed_out.weight - [ 2560, 50432,     1], type =    f32 size =   492.50 MB ->    76.95 MB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.097 0.111 0.116 0.110 0.096 0.077 0.057 0.039 0.025 0.021 
ggml_common_quantize_0: model size  = 10589.08 MB
ggml_common_quantize_0: quant size  =  1657.99 MB | ftype = 2 (q4_0)
ggml_common_quantize_0: hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 

main: quantize time = 122311.53 ms
main:    total time = 122311.53 ms

And finally attempt inference:

gpt-neox -m /tmp/q4_0.bin -p "I believe the meaning of life is"
main: seed = 1684347948
gpt_neox_model_load: loading model from '/tmp/q4_0.bin' - please wait ...
gpt_neox_model_load: n_vocab = 50432
gpt_neox_model_load: n_ctx   = 2048
gpt_neox_model_load: n_embd  = 2560
gpt_neox_model_load: n_head  = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot   = 80
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype   = 1002
gpt_neox_model_load: qntvr   = 1
gpt_neox_model_load: ggml ctx size = 3737.93 MB
gpt_neox_model_load: memory_size =   640.00 MB, n_mem = 65536
gpt_neox_model_load: ................................................ done
gpt_neox_model_load: model size =  1657.99 MB / num tensors = 388
gpt_tokenize: unknown token 'I'
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'i'
gpt_tokenize: unknown token 's'
main: number of tokens in prompt = 5
main: token[0] =   2868,  believe
main: token[1] =    783, the
main: token[2] =   4495,  meaning
main: token[3] =   1171, of
main: token[4] =   5243,  lif

 believethe meaningof lif bovember Cl~ 2017ase New Testament teaches us that weially be born us

As you can see, errors of the form gpt_tokenize: unknown token 'I' appear and the output text is nonsensical. I seem to get the same problem whether I use a 32-bit, 16-bit, or 4-bit model.

Does anything look amiss in the steps that I've performed or the logs which are generated from conversion/quantization? Any help at all would be appreciated!

ggerganov / ggml

"gpt_tokenize: unknown token" running RedPajama #163