ggerganov / ggml

Tensor library for machine learning
MIT License
11.12k stars 1.02k forks source link

"gpt_tokenize: unknown token" running RedPajama #163

Closed markdjwilliams closed 1 year ago

markdjwilliams commented 1 year ago

I'm hitting an error while running RedPajama. It's likely the result of a misunderstanding on my part, so I'm hoping somebody can shed some light on what I'm doing wrong.

To begin with, I've cloned ggml from commit 74705055853f7922e9622bdd0a1ebde2b8f57431. I build with gcc 9.4.0 on Linux x86:

mkdir build; cd build; cmake ..; make -j 12

This completes without error. I've already cloned https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1, so proceed to ggml conversion:

$ python examples/gpt-neox/convert-h5-to-ggml.py /tmp/RedPajama-INCITE-Base-3B-v1-HEAD/ 0
gpt_neox.embed_in.weight torch.Size([50432, 2560]) torch.float32
gpt_neox.layers.0.input_layernorm.weight torch.Size([2560]) torch.float32
gpt_neox.layers.0.input_layernorm.bias torch.Size([2560]) torch.float32
gpt_neox.layers.0.post_attention_layernorm.weight torch.Size([2560]) torch.float32
gpt_neox.layers.0.post_attention_layernorm.bias torch.Size([2560]) torch.float32
gpt_neox.layers.0.attention.bias torch.Size([1, 1, 2048, 2048]) torch.bool
gpt_neox.layers.0.attention.masked_bias torch.Size([]) torch.float32
gpt_neox.layers.0.attention.rotary_emb.inv_freq torch.Size([40]) torch.float32
gpt_neox.layers.0.attention.query_key_value.weight torch.Size([7680, 2560]) torch.float32
..... snip .....
gpt_neox.layers.31.attention.query_key_value.weight torch.Size([7680, 2560]) torch.float32
gpt_neox.layers.31.attention.query_key_value.bias torch.Size([7680]) torch.float32
gpt_neox.layers.31.attention.dense.weight torch.Size([2560, 2560]) torch.float32
gpt_neox.layers.31.attention.dense.bias torch.Size([2560]) torch.float32
gpt_neox.layers.31.mlp.dense_h_to_4h.weight torch.Size([10240, 2560]) torch.float32
gpt_neox.layers.31.mlp.dense_h_to_4h.bias torch.Size([10240]) torch.float32
gpt_neox.layers.31.mlp.dense_4h_to_h.weight torch.Size([2560, 10240]) torch.float32
gpt_neox.layers.31.mlp.dense_4h_to_h.bias torch.Size([2560]) torch.float32
gpt_neox.final_layer_norm.weight torch.Size([2560]) torch.float32
gpt_neox.final_layer_norm.bias torch.Size([2560]) torch.float32
embed_out.weight torch.Size([50432, 2560]) torch.float32
{'_name_or_path': 'rp_3b_800b', 'architectures': ['GPTNeoXForCausalLM'], 'bos_token_id': 0, 'eos_token_id': 0, 'hidden_act': 'gelu', 'hidden_size': 2560, 'initializer_range': 0.02, 'intermediate_size': 10240, 'layer_norm_eps': 1e-05, 'max_position_embeddings': 2048, 'model_type': 'gpt_neox', 'num_attention_heads': 32, 'num_hidden_layers': 32, 'rotary_emb_base': 10000, 'rotary_pct': 1.0, 'tie_word_embeddings': False, 'torch_dtype': 'float16', 'transformers_version': '4.28.1', 'use_cache': True, 'use_parallel_residual': False, 'vocab_size': 50432}
Processing variable: gpt_neox.embed_in.weight with shape:  (50432, 2560)
Processing variable: gpt_neox.layers.0.input_layernorm.weight with shape:  (2560,)
Processing variable: gpt_neox.layers.0.input_layernorm.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.0.post_attention_layernorm.weight with shape:  (2560,)
Processing variable: gpt_neox.layers.0.post_attention_layernorm.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.0.attention.bias with shape:  (2048, 2048)
  Skipping variable: gpt_neox.layers.0.attention.bias
Processing variable: gpt_neox.layers.0.attention.masked_bias with shape:  ()
  Skipping variable: gpt_neox.layers.0.attention.masked_bias
Processing variable: gpt_neox.layers.0.attention.rotary_emb.inv_freq with shape:  (40,)
  Skipping variable: gpt_neox.layers.0.attention.rotary_emb.inv_freq
Processing variable: gpt_neox.layers.0.attention.query_key_value.weight with shape:  (7680, 2560)
Processing variable: gpt_neox.layers.0.attention.query_key_value.bias with shape:  (7680,)
Processing variable: gpt_neox.layers.0.attention.dense.weight with shape:  (2560, 2560)
Processing variable: gpt_neox.layers.0.attention.dense.bias with shape:  (2560,)
.... snip ....
Processing variable: gpt_neox.layers.31.attention.rotary_emb.inv_freq with shape:  (40,)
  Skipping variable: gpt_neox.layers.31.attention.rotary_emb.inv_freq
Processing variable: gpt_neox.layers.31.attention.query_key_value.weight with shape:  (7680, 2560)
Processing variable: gpt_neox.layers.31.attention.query_key_value.bias with shape:  (7680,)
Processing variable: gpt_neox.layers.31.attention.dense.weight with shape:  (2560, 2560)
Processing variable: gpt_neox.layers.31.attention.dense.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.31.mlp.dense_h_to_4h.weight with shape:  (10240, 2560)
Processing variable: gpt_neox.layers.31.mlp.dense_h_to_4h.bias with shape:  (10240,)
Processing variable: gpt_neox.layers.31.mlp.dense_4h_to_h.weight with shape:  (2560, 10240)
Processing variable: gpt_neox.layers.31.mlp.dense_4h_to_h.bias with shape:  (2560,)
Processing variable: gpt_neox.final_layer_norm.weight with shape:  (2560,)
Processing variable: gpt_neox.final_layer_norm.bias with shape:  (2560,)
Processing variable: embed_out.weight with shape:  (50432, 2560)
Done. Output file: /tmp/ggml-model-f32.bin

Next, I quantize the model:

$ gpt-neox-quantize /tmp/RedPajama-INCITE-Base-3B-v1-HEAD/ggml-model-f32.bin /tmp/q4_0.bin "q4_0"
gpt_neox_model_quantize: loading model from '/tmp/ggml-model-f32.bin'
gpt_neox_model_quantize: n_vocab     = 50432
gpt_neox_model_quantize: n_ctx       = 2048
gpt_neox_model_quantize: n_embd      = 2560
gpt_neox_model_quantize: n_head      = 32
gpt_neox_model_quantize: n_layer     = 32
gpt_neox_model_quantize: par_res     = 0
gpt_neox_model_quantize: ftype (src) = 0
gpt_neox_model_quantize: qntvr (src) = 0
gpt_neox_model_quantize: ftype (dst) = 1002
gpt_neox_model_quantize: qntvr (dst) = 1
                                        gpt_neox.embed_in.weight - [ 2560, 50432,     1], type =    f32 size =   492.50 MB ->    76.95 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
                        gpt_neox.layers.0.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.0.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
               gpt_neox.layers.0.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                 gpt_neox.layers.0.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.0.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
                gpt_neox.layers.0.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                        gpt_neox.layers.0.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.013 0.021 0.033 0.051 0.074 0.099 0.122 0.132 0.122 0.099 0.074 0.051 0.033 0.021 0.017 
                          gpt_neox.layers.0.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                      gpt_neox.layers.0.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                        gpt_neox.layers.0.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                      gpt_neox.layers.0.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                        gpt_neox.layers.0.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                        gpt_neox.layers.1.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.1.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
               gpt_neox.layers.1.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                 gpt_neox.layers.1.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.1.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
                gpt_neox.layers.1.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                        gpt_neox.layers.1.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                          gpt_neox.layers.1.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                      gpt_neox.layers.1.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                        gpt_neox.layers.1.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                      gpt_neox.layers.1.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                        gpt_neox.layers.1.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                        gpt_neox.layers.2.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.2.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
.... snip ....
              gpt_neox.layers.30.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                gpt_neox.layers.30.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
             gpt_neox.layers.30.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
               gpt_neox.layers.30.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                       gpt_neox.layers.30.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
                         gpt_neox.layers.30.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                     gpt_neox.layers.30.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                       gpt_neox.layers.30.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                     gpt_neox.layers.30.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.014 0.022 0.035 0.053 0.075 0.099 0.118 0.126 0.118 0.099 0.075 0.053 0.035 0.022 0.018 
                       gpt_neox.layers.30.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                       gpt_neox.layers.31.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                         gpt_neox.layers.31.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.31.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                gpt_neox.layers.31.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
             gpt_neox.layers.31.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.113 0.097 0.076 0.055 0.038 0.025 0.020 
               gpt_neox.layers.31.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                       gpt_neox.layers.31.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
                         gpt_neox.layers.31.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                     gpt_neox.layers.31.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                       gpt_neox.layers.31.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                     gpt_neox.layers.31.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.014 0.022 0.035 0.052 0.074 0.099 0.120 0.129 0.120 0.099 0.074 0.052 0.035 0.022 0.018 
                       gpt_neox.layers.31.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                gpt_neox.final_layer_norm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                  gpt_neox.final_layer_norm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                                embed_out.weight - [ 2560, 50432,     1], type =    f32 size =   492.50 MB ->    76.95 MB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.097 0.111 0.116 0.110 0.096 0.077 0.057 0.039 0.025 0.021 
ggml_common_quantize_0: model size  = 10589.08 MB
ggml_common_quantize_0: quant size  =  1657.99 MB | ftype = 2 (q4_0)
ggml_common_quantize_0: hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 

main: quantize time = 122311.53 ms
main:    total time = 122311.53 ms

And finally attempt inference:

gpt-neox -m /tmp/q4_0.bin -p "I believe the meaning of life is"
main: seed = 1684347948
gpt_neox_model_load: loading model from '/tmp/q4_0.bin' - please wait ...
gpt_neox_model_load: n_vocab = 50432
gpt_neox_model_load: n_ctx   = 2048
gpt_neox_model_load: n_embd  = 2560
gpt_neox_model_load: n_head  = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot   = 80
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype   = 1002
gpt_neox_model_load: qntvr   = 1
gpt_neox_model_load: ggml ctx size = 3737.93 MB
gpt_neox_model_load: memory_size =   640.00 MB, n_mem = 65536
gpt_neox_model_load: ................................................ done
gpt_neox_model_load: model size =  1657.99 MB / num tensors = 388
gpt_tokenize: unknown token 'I'
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'i'
gpt_tokenize: unknown token 's'
main: number of tokens in prompt = 5
main: token[0] =   2868,  believe
main: token[1] =    783, the
main: token[2] =   4495,  meaning
main: token[3] =   1171, of
main: token[4] =   5243,  lif

 believethe meaningof lif bovember Cl~ 2017ase New Testament teaches us that weially be born us

As you can see, errors of the form gpt_tokenize: unknown token 'I' appear and the output text is nonsensical. I seem to get the same problem whether I use a 32-bit, 16-bit, or 4-bit model.

Does anything look amiss in the steps that I've performed or the logs which are generated from conversion/quantization? Any help at all would be appreciated!

markdjwilliams commented 1 year ago

The same failure occurs for the Mosaic model.

However, I think I've found the problem. The highlighted line here defines std::string word; outside of the vocab loading loop which is being updated as each word in the vocabulary is loaded. Simply moving the definition of word to within the inner loop seems to allow correct tokenization and inference, at least on my platform/compiler.

markdjwilliams commented 1 year ago

So std::string.data() returns a const char * for earlier revisions of the C++ specification.

This line casts away this constness before writing to the underlying storage, so on my compiler replacing (char *)word.data() with &word[0] also fixed the issue.