GPT-NeoX HuggingFace Converter does not work

ankit-db commented 1 year ago

Branch/Tag/Commit

main

Docker Image Version

not-specific-to-docker-image

GPU name

all GPUs

CUDA Driver

n/a

Reproduced Steps

Merely running the example at https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptneox/utils/huggingface_jp_gptneox_convert.py does not appear to work, even with the version of Transformers fixed to the one listed in the comment. This seems to be because the weights for HuggingFace do not have the names that are like `gpt_neox.layers.0.post_attention_layernorm.weight` and not like `transformer.X`, which is what the code seems to be expecting.

Am I missing something here? It seems this code does not apply to the model config

byshiue commented 1 year ago

Please provide reproduced steps.

ankit-db commented 1 year ago

@byshiue appreciate your quick response

python huggingface_jp_gptneox_convert.py -saved_dir ~/model-dir -in_file EleutherAI/gpt-neox-20b -trained_gpu_num 1 -infer_gpu_num 4

ankit-db commented 1 year ago

This fails to output the config.ini because n_head is not in hf_config in this line. n_head is actually used in many places, so all of those break.

Taking a deeper look, this for loop does not match any of the weight files, since the weights are of the format gpt_neox.embed_in.weight, but the pattern is trying to match to other names

wangruo91 commented 1 year ago

I also met this issue, the names in model.named_parameters() don't match the names in the for loop. Not only the names, but also the nums.

byshiue commented 1 year ago

The convert is only used for this kind of model https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese

It is not gpt_neox_converter. For gpt neox, we only have this converter https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py

ankit-db commented 1 year ago

@byshiue I see - aren't those the same architecture with a different tokenizer? In your opinion, if I just compared the different layer names and wrote a converter similar to the jp one, would it work? I'm happy to contribute it.

ankit-db commented 1 year ago

@byshiue are you sure it works for the GPT NeoX Japanese? I just tried loading that model and it also doesn't have n_head and its param names are also pretty much the same structure as GPT NeoX (the regular one)

byshiue commented 1 year ago

The tested model is this one https://huggingface.co/rinna/japanese-gpt-neox-small. Do you use this model?

ankit-db commented 1 year ago

Yes - here is the output of the weight names from that:

gpt_neox.embed_in.weight
gpt_neox.layers.0.input_layernorm.weight
gpt_neox.layers.0.input_layernorm.bias
gpt_neox.layers.0.post_attention_layernorm.weight
gpt_neox.layers.0.post_attention_layernorm.bias
gpt_neox.layers.0.attention.query_key_value.weight
gpt_neox.layers.0.attention.query_key_value.bias
gpt_neox.layers.0.attention.dense.weight
gpt_neox.layers.0.attention.dense.bias
gpt_neox.layers.0.mlp.dense_h_to_4h.weight
gpt_neox.layers.0.mlp.dense_h_to_4h.bias
gpt_neox.layers.0.mlp.dense_4h_to_h.weight
gpt_neox.layers.0.mlp.dense_4h_to_h.bias
gpt_neox.layers.1.input_layernorm.weight
gpt_neox.layers.1.input_layernorm.bias
gpt_neox.layers.1.post_attention_layernorm.weight
gpt_neox.layers.1.post_attention_layernorm.bias
gpt_neox.layers.1.attention.query_key_value.weight
gpt_neox.layers.1.attention.query_key_value.bias
gpt_neox.layers.1.attention.dense.weight
gpt_neox.layers.1.attention.dense.bias
gpt_neox.layers.1.mlp.dense_h_to_4h.weight
gpt_neox.layers.1.mlp.dense_h_to_4h.bias
gpt_neox.layers.1.mlp.dense_4h_to_h.weight
gpt_neox.layers.1.mlp.dense_4h_to_h.bias
gpt_neox.layers.2.input_layernorm.weight
gpt_neox.layers.2.input_layernorm.bias
gpt_neox.layers.2.post_attention_layernorm.weight
gpt_neox.layers.2.post_attention_layernorm.bias
gpt_neox.layers.2.attention.query_key_value.weight
gpt_neox.layers.2.attention.query_key_value.bias
gpt_neox.layers.2.attention.dense.weight
gpt_neox.layers.2.attention.dense.bias
gpt_neox.layers.2.mlp.dense_h_to_4h.weight
gpt_neox.layers.2.mlp.dense_h_to_4h.bias
gpt_neox.layers.2.mlp.dense_4h_to_h.weight
gpt_neox.layers.2.mlp.dense_4h_to_h.bias
gpt_neox.layers.3.input_layernorm.weight
gpt_neox.layers.3.input_layernorm.bias
gpt_neox.layers.3.post_attention_layernorm.weight
gpt_neox.layers.3.post_attention_layernorm.bias
gpt_neox.layers.3.attention.query_key_value.weight
gpt_neox.layers.3.attention.query_key_value.bias
gpt_neox.layers.3.attention.dense.weight
gpt_neox.layers.3.attention.dense.bias
gpt_neox.layers.3.mlp.dense_h_to_4h.weight
gpt_neox.layers.3.mlp.dense_h_to_4h.bias
gpt_neox.layers.3.mlp.dense_4h_to_h.weight
gpt_neox.layers.3.mlp.dense_4h_to_h.bias
gpt_neox.layers.4.input_layernorm.weight
gpt_neox.layers.4.input_layernorm.bias
gpt_neox.layers.4.post_attention_layernorm.weight
gpt_neox.layers.4.post_attention_layernorm.bias
gpt_neox.layers.4.attention.query_key_value.weight
gpt_neox.layers.4.attention.query_key_value.bias
gpt_neox.layers.4.attention.dense.weight
gpt_neox.layers.4.attention.dense.bias
gpt_neox.layers.4.mlp.dense_h_to_4h.weight
gpt_neox.layers.4.mlp.dense_h_to_4h.bias
gpt_neox.layers.4.mlp.dense_4h_to_h.weight
gpt_neox.layers.4.mlp.dense_4h_to_h.bias
gpt_neox.layers.5.input_layernorm.weight
gpt_neox.layers.5.input_layernorm.bias
gpt_neox.layers.5.post_attention_layernorm.weight
gpt_neox.layers.5.post_attention_layernorm.bias
gpt_neox.layers.5.attention.query_key_value.weight
gpt_neox.layers.5.attention.query_key_value.bias
gpt_neox.layers.5.attention.dense.weight
gpt_neox.layers.5.attention.dense.bias
gpt_neox.layers.5.mlp.dense_h_to_4h.weight
gpt_neox.layers.5.mlp.dense_h_to_4h.bias
gpt_neox.layers.5.mlp.dense_4h_to_h.weight
gpt_neox.layers.5.mlp.dense_4h_to_h.bias
gpt_neox.layers.6.input_layernorm.weight
gpt_neox.layers.6.input_layernorm.bias
gpt_neox.layers.6.post_attention_layernorm.weight
gpt_neox.layers.6.post_attention_layernorm.bias
gpt_neox.layers.6.attention.query_key_value.weight
gpt_neox.layers.6.attention.query_key_value.bias
gpt_neox.layers.6.attention.dense.weight
gpt_neox.layers.6.attention.dense.bias
gpt_neox.layers.6.mlp.dense_h_to_4h.weight
gpt_neox.layers.6.mlp.dense_h_to_4h.bias
gpt_neox.layers.6.mlp.dense_4h_to_h.weight
gpt_neox.layers.6.mlp.dense_4h_to_h.bias
gpt_neox.layers.7.input_layernorm.weight
gpt_neox.layers.7.input_layernorm.bias
gpt_neox.layers.7.post_attention_layernorm.weight
gpt_neox.layers.7.post_attention_layernorm.bias
gpt_neox.layers.7.attention.query_key_value.weight
gpt_neox.layers.7.attention.query_key_value.bias
gpt_neox.layers.7.attention.dense.weight
gpt_neox.layers.7.attention.dense.bias
gpt_neox.layers.7.mlp.dense_h_to_4h.weight
gpt_neox.layers.7.mlp.dense_h_to_4h.bias
gpt_neox.layers.7.mlp.dense_4h_to_h.weight
gpt_neox.layers.7.mlp.dense_4h_to_h.bias
gpt_neox.layers.8.input_layernorm.weight
gpt_neox.layers.8.input_layernorm.bias
gpt_neox.layers.8.post_attention_layernorm.weight
gpt_neox.layers.8.post_attention_layernorm.bias
gpt_neox.layers.8.attention.query_key_value.weight
gpt_neox.layers.8.attention.query_key_value.bias
gpt_neox.layers.8.attention.dense.weight
gpt_neox.layers.8.attention.dense.bias
gpt_neox.layers.8.mlp.dense_h_to_4h.weight
gpt_neox.layers.8.mlp.dense_h_to_4h.bias
gpt_neox.layers.8.mlp.dense_4h_to_h.weight
gpt_neox.layers.8.mlp.dense_4h_to_h.bias
gpt_neox.layers.9.input_layernorm.weight
gpt_neox.layers.9.input_layernorm.bias
gpt_neox.layers.9.post_attention_layernorm.weight
gpt_neox.layers.9.post_attention_layernorm.bias
gpt_neox.layers.9.attention.query_key_value.weight
gpt_neox.layers.9.attention.query_key_value.bias
gpt_neox.layers.9.attention.dense.weight
gpt_neox.layers.9.attention.dense.bias
gpt_neox.layers.9.mlp.dense_h_to_4h.weight
gpt_neox.layers.9.mlp.dense_h_to_4h.bias
gpt_neox.layers.9.mlp.dense_4h_to_h.weight
gpt_neox.layers.9.mlp.dense_4h_to_h.bias
gpt_neox.layers.10.input_layernorm.weight
gpt_neox.layers.10.input_layernorm.bias
gpt_neox.layers.10.post_attention_layernorm.weight
gpt_neox.layers.10.post_attention_layernorm.bias
gpt_neox.layers.10.attention.query_key_value.weight
gpt_neox.layers.10.attention.query_key_value.bias
gpt_neox.layers.10.attention.dense.weight
gpt_neox.layers.10.attention.dense.bias
gpt_neox.layers.10.mlp.dense_h_to_4h.weight
gpt_neox.layers.10.mlp.dense_h_to_4h.bias
gpt_neox.layers.10.mlp.dense_4h_to_h.weight
gpt_neox.layers.10.mlp.dense_4h_to_h.bias
gpt_neox.layers.11.input_layernorm.weight
gpt_neox.layers.11.input_layernorm.bias
gpt_neox.layers.11.post_attention_layernorm.weight
gpt_neox.layers.11.post_attention_layernorm.bias
gpt_neox.layers.11.attention.query_key_value.weight
gpt_neox.layers.11.attention.query_key_value.bias
gpt_neox.layers.11.attention.dense.weight
gpt_neox.layers.11.attention.dense.bias
gpt_neox.layers.11.mlp.dense_h_to_4h.weight
gpt_neox.layers.11.mlp.dense_h_to_4h.bias
gpt_neox.layers.11.mlp.dense_4h_to_h.weight
gpt_neox.layers.11.mlp.dense_4h_to_h.bias
gpt_neox.final_layer_norm.weight
gpt_neox.final_layer_norm.bias
embed_out.weight

ankit-db commented 1 year ago

As you can see, this naming structure will not work at all for the converter. Note that to load the model, I did: model_jp_real = GPTNeoXForCausalLM.from_pretrained('rinna/japanese-gpt-neox-small')

ankit-db commented 1 year ago

Just to confirm - is embed_out = lm_head here? I would think so?

ankit-db commented 1 year ago

Also, does FasterTransformers require model.layers.0.mlp.attention.bias.sum.bin? What purpose does precomputing the sum here accomplish?

wsxiaoys commented 1 year ago

https://github.com/TabbyML/tabby/blob/main/tabby/tools/converter/huggingface_gptneox_convert.py

This is a working version for generic gpt-neox model.

converted model: https://huggingface.co/TabbyML/NeoX-1.3B/tree/main/triton

byshiue commented 1 year ago

I can convert the model by following scripts without any issue, can you try again?

bhsueh@342094f67f67:/home/scratch.bhsueh_sw/FasterTransformer_new/h100_build$ git lfs  clone https://huggingface.co/rinna/japanese-gpt-neox-small 

WARNING: 'git lfs clone' is deprecated and will not be updated
          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into 'japanese-gpt-neox-small'...
remote: Enumerating objects: 28, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 28 (delta 0), reused 0 (delta 0), pack-reused 23
Unpacking objects: 100% (28/28), 732.18 KiB | 1.63 MiB/s, done.

bhsueh@342094f67f67:/home/scratch.bhsueh_sw/FasterTransformer_new/h100_build$ python3 ../examples/pytorch/gpt/utils/huggingface_jp_gpt_convert.py  -i ./japanese-gpt-neox-small/ -o ./tmp -i_g 1 

=============== Argument ===============
saved_dir: ./tmp
in_file: ./japanese-gpt-neox-small/
trained_gpu_num: 1
infer_gpu_num: 1
processes: 4
weight_data_type: fp32
========================================
You are using a model of type gpt_neox to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at ./japanese-gpt-neox-small/ were not used when initializing GPT2LMHeadModel: ['gpt_neox.layers.8.attention.query_key_value.bias', 'gpt_neox.layers.11.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.9.post_attention_layernorm.bias', 'gpt_neox.layers.9.attention.dense.bias', 'gpt_neox.final_layer_norm.weight', 'gpt_neox.layers.6.attention.dense.weight', 'gpt_neox.layers.9.attention.query_key_value.weight', 'gpt_neox.layers.1.attention.dense.weight', 'gpt_neox.layers.10.input_layernorm.bias', 'gpt_neox.layers.5.attention.rotary_emb.inv_freq', 'gpt_neox.layers.9.attention.rotary_emb.inv_freq', 'gpt_neox.layers.2.post_attention_layernorm.weight', 'gpt_neox.layers.1.post_attention_layernorm.bias', 'gpt_neox.layers.6.attention.rotary_emb.inv_freq', 'gpt_neox.layers.8.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.4.attention.masked_bias', 'gpt_neox.layers.5.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.0.attention.rotary_emb.inv_freq', 'gpt_neox.layers.7.attention.masked_bias', 'gpt_neox.layers.8.attention.dense.weight', 'gpt_neox.layers.10.mlp.dense_4h_to_h.bias', 'gpt_neox.embed_in.weight', 'gpt_neox.layers.8.attention.query_key_value.weight', 'gpt_neox.layers.3.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.6.attention.bias', 'gpt_neox.layers.4.attention.dense.weight', 'gpt_neox.layers.4.input_layernorm.bias', 'gpt_neox.layers.0.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.7.post_attention_layernorm.bias', 'gpt_neox.layers.4.attention.query_key_value.weight', 'gpt_neox.layers.5.attention.masked_bias', 'gpt_neox.layers.1.post_attention_layernorm.weight', 'gpt_neox.layers.1.attention.masked_bias', 'gpt_neox.layers.3.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.3.attention.query_key_value.bias', 'gpt_neox.layers.10.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.1.attention.query_key_value.bias', 'gpt_neox.layers.3.attention.dense.bias', 'gpt_neox.layers.3.attention.dense.weight', 'gpt_neox.layers.11.post_attention_layernorm.bias', 'gpt_neox.layers.10.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.1.attention.bias', 'gpt_neox.layers.9.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.11.attention.bias', 'gpt_neox.layers.10.attention.bias', 'gpt_neox.layers.6.input_layernorm.weight', 'gpt_neox.layers.7.input_layernorm.weight', 'gpt_neox.layers.0.post_attention_layernorm.weight', 'gpt_neox.layers.2.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.2.input_layernorm.bias', 'gpt_neox.layers.9.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.0.post_attention_layernorm.bias', 'gpt_neox.layers.10.attention.dense.weight', 'gpt_neox.layers.5.attention.dense.bias', 'gpt_neox.layers.6.attention.dense.bias', 'gpt_neox.layers.0.input_layernorm.weight', 'gpt_neox.layers.8.attention.bias', 'gpt_neox.layers.5.post_attention_layernorm.bias', 'gpt_neox.layers.2.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.5.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.0.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.0.attention.bias', 'gpt_neox.layers.1.attention.rotary_emb.inv_freq', 'gpt_neox.layers.2.attention.dense.bias', 'gpt_neox.layers.2.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.4.attention.query_key_value.bias', 'gpt_neox.layers.4.mlp.dense_4h_to_h.bias', 'gpt_neox.final_layer_norm.bias', 'gpt_neox.layers.0.input_layernorm.bias', 'gpt_neox.layers.0.attention.masked_bias', 'gpt_neox.layers.9.post_attention_layernorm.weight', 'gpt_neox.layers.3.post_attention_layernorm.weight', 'gpt_neox.layers.10.attention.rotary_emb.inv_freq', 'gpt_neox.layers.5.attention.bias', 'gpt_neox.layers.9.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.11.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.10.post_attention_layernorm.bias', 'gpt_neox.layers.11.input_layernorm.bias', 'gpt_neox.layers.3.input_layernorm.weight', 'gpt_neox.layers.6.attention.query_key_value.weight', 'gpt_neox.layers.7.attention.rotary_emb.inv_freq', 'gpt_neox.layers.8.post_attention_layernorm.bias', 'gpt_neox.layers.1.input_layernorm.weight', 'gpt_neox.layers.4.post_attention_layernorm.bias', 'gpt_neox.layers.0.attention.query_key_value.weight', 'gpt_neox.layers.11.attention.rotary_emb.inv_freq', 'gpt_neox.layers.4.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.0.attention.dense.bias', 'gpt_neox.layers.6.post_attention_layernorm.bias', 'gpt_neox.layers.7.attention.dense.bias', 'gpt_neox.layers.10.post_attention_layernorm.weight', 'gpt_neox.layers.6.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.2.attention.bias', 'gpt_neox.layers.0.attention.query_key_value.bias', 'gpt_neox.layers.9.attention.query_key_value.bias', 'gpt_neox.layers.3.post_attention_layernorm.bias', 'gpt_neox.layers.9.input_layernorm.bias', 'gpt_neox.layers.2.attention.dense.weight', 'gpt_neox.layers.4.post_attention_layernorm.weight', 'gpt_neox.layers.2.input_layernorm.weight', 'gpt_neox.layers.8.attention.masked_bias', 'gpt_neox.layers.9.input_layernorm.weight', 'gpt_neox.layers.3.attention.masked_bias', 'gpt_neox.layers.11.attention.dense.bias', 'gpt_neox.layers.6.post_attention_layernorm.weight', 'gpt_neox.layers.9.attention.bias', 'gpt_neox.layers.5.input_layernorm.weight', 'gpt_neox.layers.10.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.2.attention.rotary_emb.inv_freq', 'gpt_neox.layers.3.attention.query_key_value.weight', 'gpt_neox.layers.3.attention.rotary_emb.inv_freq', 'gpt_neox.layers.7.post_attention_layernorm.weight', 'gpt_neox.layers.10.attention.masked_bias', 'gpt_neox.layers.3.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.11.attention.masked_bias', 'gpt_neox.layers.10.attention.query_key_value.bias', 'gpt_neox.layers.5.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.11.attention.dense.weight', 'gpt_neox.layers.8.input_layernorm.weight', 'gpt_neox.layers.1.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.5.post_attention_layernorm.weight', 'gpt_neox.layers.7.input_layernorm.bias', 'gpt_neox.layers.0.attention.dense.weight', 'gpt_neox.layers.5.input_layernorm.bias', 'gpt_neox.layers.6.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.7.attention.query_key_value.bias', 'gpt_neox.layers.3.attention.bias', 'gpt_neox.layers.11.attention.query_key_value.weight', 'gpt_neox.layers.1.attention.dense.bias', 'gpt_neox.layers.8.input_layernorm.bias', 'gpt_neox.layers.8.attention.rotary_emb.inv_freq', 'gpt_neox.layers.4.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.11.attention.query_key_value.bias', 'gpt_neox.layers.5.attention.query_key_value.weight', 'gpt_neox.layers.0.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.8.post_attention_layernorm.weight', 'gpt_neox.layers.2.attention.query_key_value.weight', 'gpt_neox.layers.10.attention.dense.bias', 'gpt_neox.layers.6.input_layernorm.bias', 'gpt_neox.layers.9.attention.dense.weight', 'gpt_neox.layers.6.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.11.post_attention_layernorm.weight', 'gpt_neox.layers.7.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.1.input_layernorm.bias', 'gpt_neox.layers.4.attention.rotary_emb.inv_freq', 'gpt_neox.layers.7.attention.query_key_value.weight', 'gpt_neox.layers.7.attention.bias', 'gpt_neox.layers.7.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.4.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.6.attention.query_key_value.bias', 'gpt_neox.layers.1.attention.query_key_value.weight', 'gpt_neox.layers.10.attention.query_key_value.weight', 'gpt_neox.layers.3.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.3.input_layernorm.bias', 'gpt_neox.layers.7.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.1.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.11.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.4.attention.dense.bias', 'gpt_neox.layers.1.mlp.dense_4h_to_h.weight', 'gpt_neox.layers.5.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.0.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.2.post_attention_layernorm.bias', 'gpt_neox.layers.11.input_layernorm.weight', 'gpt_neox.layers.1.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.9.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.7.attention.dense.weight', 'gpt_neox.layers.9.attention.masked_bias', 'embed_out.weight', 'gpt_neox.layers.2.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.5.attention.dense.weight', 'gpt_neox.layers.8.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.6.attention.masked_bias', 'gpt_neox.layers.2.attention.query_key_value.bias', 'gpt_neox.layers.10.input_layernorm.weight', 'gpt_neox.layers.8.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.7.mlp.dense_h_to_4h.bias', 'gpt_neox.layers.11.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.4.input_layernorm.weight', 'gpt_neox.layers.4.attention.bias', 'gpt_neox.layers.8.mlp.dense_h_to_4h.weight', 'gpt_neox.layers.6.mlp.dense_4h_to_h.bias', 'gpt_neox.layers.5.attention.query_key_value.bias', 'gpt_neox.layers.2.attention.masked_bias', 'gpt_neox.layers.8.attention.dense.bias']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at ./japanese-gpt-neox-small/ and are newly initialized: ['h.10.ln_2.weight', 'h.5.attn.c_proj.bias', 'h.9.ln_1.bias', 'h.8.attn.c_proj.weight', 'h.3.ln_2.weight', 'h.11.mlp.c_proj.weight', 'h.4.ln_2.weight', 'h.11.ln_1.bias', 'h.11.mlp.c_proj.bias', 'h.10.ln_1.bias', 'h.8.ln_1.weight', 'h.6.attn.c_proj.weight', 'h.11.attn.c_attn.weight', 'h.9.ln_2.bias', 'h.6.attn.c_proj.bias', 'h.9.mlp.c_proj.weight', 'h.9.mlp.c_fc.bias', 'h.11.mlp.c_fc.bias', 'h.0.attn.c_attn.weight', 'h.8.mlp.c_fc.bias', 'h.6.ln_2.bias', 'h.7.ln_1.weight', 'h.7.attn.c_proj.bias', 'h.7.mlp.c_proj.weight', 'h.4.ln_2.bias', 'h.10.attn.c_attn.weight', 'h.2.ln_1.weight', 'h.11.attn.c_proj.weight', 'h.4.mlp.c_fc.weight', 'h.2.attn.c_attn.weight', 'h.9.mlp.c_proj.bias', 'h.10.mlp.c_fc.weight', 'h.2.mlp.c_proj.weight', 'h.9.mlp.c_fc.weight', 'h.1.mlp.c_proj.bias', 'h.6.ln_2.weight', 'h.0.mlp.c_proj.bias', 'h.3.attn.c_proj.bias', 'h.10.mlp.c_proj.bias', 'h.3.ln_1.weight', 'h.2.mlp.c_proj.bias', 'h.9.attn.c_proj.weight', 'h.0.ln_2.weight', 'h.1.attn.c_proj.bias', 'h.10.attn.c_proj.bias', 'h.8.ln_1.bias', 'wpe.weight', 'ln_f.weight', 'h.5.ln_2.weight', 'ln_f.bias', 'h.11.mlp.c_fc.weight', 'h.0.ln_2.bias', 'h.5.ln_2.bias', 'h.0.mlp.c_proj.weight', 'h.3.attn.c_attn.weight', 'h.2.attn.c_proj.bias', 'h.9.ln_2.weight', 'h.7.ln_2.bias', 'h.6.mlp.c_proj.bias', 'h.1.ln_2.weight', 'h.6.ln_1.bias', 'h.11.ln_2.weight', 'h.10.mlp.c_fc.bias', 'h.5.mlp.c_fc.weight', 'h.1.mlp.c_proj.weight', 'h.1.attn.c_proj.weight', 'h.1.mlp.c_fc.bias', 'h.4.mlp.c_proj.weight', 'h.5.attn.c_proj.weight', 'h.3.ln_1.bias', 'h.0.mlp.c_fc.weight', 'h.2.attn.c_proj.weight', 'h.3.mlp.c_fc.weight', 'h.7.attn.c_attn.weight', 'h.4.attn.c_attn.weight', 'h.4.mlp.c_proj.bias', 'h.4.ln_1.weight', 'h.5.attn.c_attn.weight', 'h.5.mlp.c_proj.weight', 'h.4.attn.c_proj.bias', 'h.6.mlp.c_fc.weight', 'h.3.mlp.c_proj.bias', 'h.0.attn.c_proj.bias', 'h.8.mlp.c_fc.weight', 'h.0.attn.c_proj.weight', 'h.2.ln_1.bias', 'h.6.mlp.c_proj.weight', 'h.8.mlp.c_proj.weight', 'h.11.ln_2.bias', 'h.8.ln_2.bias', 'h.8.mlp.c_proj.bias', 'h.3.ln_2.bias', 'h.4.ln_1.bias', 'h.7.ln_1.bias', 'h.5.mlp.c_fc.bias', 'h.1.ln_1.bias', 'h.2.mlp.c_fc.weight', 'h.3.mlp.c_fc.bias', 'h.7.attn.c_proj.weight', 'h.11.attn.c_proj.bias', 'h.3.attn.c_proj.weight', 'h.7.mlp.c_fc.bias', 'h.7.ln_2.weight', 'h.1.mlp.c_fc.weight', 'h.10.ln_1.weight', 'h.11.ln_1.weight', 'h.2.mlp.c_fc.bias', 'h.10.attn.c_proj.weight', 'h.0.ln_1.weight', 'h.5.ln_1.weight', 'h.4.attn.c_proj.weight', 'wte.weight', 'h.1.ln_2.bias', 'h.6.attn.c_attn.weight', 'h.2.ln_2.bias', 'h.8.attn.c_proj.bias', 'h.7.mlp.c_fc.weight', 'h.8.ln_2.weight', 'h.4.mlp.c_fc.bias', 'h.6.ln_1.weight', 'h.10.ln_2.bias', 'h.5.mlp.c_proj.bias', 'h.3.mlp.c_proj.weight', 'h.9.attn.c_proj.bias', 'h.9.ln_1.weight', 'h.5.ln_1.bias', 'h.2.ln_2.weight', 'h.8.attn.c_attn.weight', 'h.9.attn.c_attn.weight', 'h.7.mlp.c_proj.bias', 'h.0.mlp.c_fc.bias', 'h.1.ln_1.weight', 'h.1.attn.c_attn.weight', 'h.6.mlp.c_fc.bias', 'h.0.ln_1.bias', 'h.10.mlp.c_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
transformer.wte.weight
transformer.wpe.weight
transformer.h.0.ln_1.weight
transformer.h.0.ln_1.bias
transformer.h.0.attn.bias
transformer.h.0.attn.masked_bias
transformer.h.0.attn.c_attn.weight
transformer.h.0.attn.c_attn.bias
transformer.h.0.attn.c_proj.weight
transformer.h.0.attn.c_proj.bias
transformer.h.0.ln_2.weight
transformer.h.0.ln_2.bias
transformer.h.0.mlp.c_fc.weight
transformer.h.0.mlp.c_fc.bias
transformer.h.0.mlp.c_proj.weight
transformer.h.0.mlp.c_proj.bias
transformer.h.1.ln_1.weight
transformer.h.1.ln_1.bias
transformer.h.1.attn.bias
transformer.h.1.attn.masked_bias
transformer.h.1.attn.c_attn.weight
transformer.h.1.attn.c_attn.bias
transformer.h.1.attn.c_proj.weight
transformer.h.1.attn.c_proj.bias
transformer.h.1.ln_2.weight
transformer.h.1.ln_2.bias
transformer.h.1.mlp.c_fc.weight
transformer.h.1.mlp.c_fc.bias
transformer.h.1.mlp.c_proj.weight
transformer.h.1.mlp.c_proj.bias
transformer.h.2.ln_1.weight
transformer.h.2.ln_1.bias
transformer.h.2.attn.bias
transformer.h.2.attn.masked_bias
transformer.h.2.attn.c_attn.weight
transformer.h.2.attn.c_attn.bias
transformer.h.2.attn.c_proj.weight
transformer.h.2.attn.c_proj.bias
transformer.h.2.ln_2.weight
transformer.h.2.ln_2.bias
transformer.h.2.mlp.c_fc.weight
transformer.h.2.mlp.c_fc.bias
transformer.h.2.mlp.c_proj.weight
transformer.h.2.mlp.c_proj.bias
transformer.h.3.ln_1.weight
transformer.h.3.ln_1.bias
transformer.h.3.attn.bias
transformer.h.3.attn.masked_bias
transformer.h.3.attn.c_attn.weight
transformer.h.3.attn.c_attn.bias
transformer.h.3.attn.c_proj.weight
transformer.h.3.attn.c_proj.bias
transformer.h.3.ln_2.weight
transformer.h.3.ln_2.bias
transformer.h.3.mlp.c_fc.weight
transformer.h.3.mlp.c_fc.bias
transformer.h.3.mlp.c_proj.weight
transformer.h.3.mlp.c_proj.bias
transformer.h.4.ln_1.weight
transformer.h.4.ln_1.bias
transformer.h.4.attn.bias
transformer.h.4.attn.masked_bias
transformer.h.4.attn.c_attn.weight
transformer.h.4.attn.c_attn.bias
transformer.h.4.attn.c_proj.weight
transformer.h.4.attn.c_proj.bias
transformer.h.4.ln_2.weight
transformer.h.4.ln_2.bias
transformer.h.4.mlp.c_fc.weight
transformer.h.4.mlp.c_fc.bias
transformer.h.4.mlp.c_proj.weight
transformer.h.4.mlp.c_proj.bias
transformer.h.5.ln_1.weight
transformer.h.5.ln_1.bias
transformer.h.5.attn.bias
transformer.h.5.attn.masked_bias
transformer.h.5.attn.c_attn.weight
transformer.h.5.attn.c_attn.bias
transformer.h.5.attn.c_proj.weight
transformer.h.5.attn.c_proj.bias
transformer.h.5.ln_2.weight
transformer.h.5.ln_2.bias
transformer.h.5.mlp.c_fc.weight
transformer.h.5.mlp.c_fc.bias
transformer.h.5.mlp.c_proj.weight
transformer.h.5.mlp.c_proj.bias
transformer.h.6.ln_1.weight
transformer.h.6.ln_1.bias
transformer.h.6.attn.bias
transformer.h.6.attn.masked_bias
transformer.h.6.attn.c_attn.weight
transformer.h.6.attn.c_attn.bias
transformer.h.6.attn.c_proj.weight
transformer.h.6.attn.c_proj.bias
transformer.h.6.ln_2.weight
transformer.h.6.ln_2.bias
transformer.h.6.mlp.c_fc.weight
transformer.h.6.mlp.c_fc.bias
transformer.h.6.mlp.c_proj.weight
transformer.h.6.mlp.c_proj.bias
transformer.h.7.ln_1.weight
transformer.h.7.ln_1.bias
transformer.h.7.attn.bias
transformer.h.7.attn.masked_bias
transformer.h.7.attn.c_attn.weight
transformer.h.7.attn.c_attn.bias
transformer.h.7.attn.c_proj.weight
transformer.h.7.attn.c_proj.bias
transformer.h.7.ln_2.weight
transformer.h.7.ln_2.bias
transformer.h.7.mlp.c_fc.weight
transformer.h.7.mlp.c_fc.bias
transformer.h.7.mlp.c_proj.weight
transformer.h.7.mlp.c_proj.bias
transformer.h.8.ln_1.weight
transformer.h.8.ln_1.bias
transformer.h.8.attn.bias
transformer.h.8.attn.masked_bias
transformer.h.8.attn.c_attn.weight
transformer.h.8.attn.c_attn.bias
transformer.h.8.attn.c_proj.weight
transformer.h.8.attn.c_proj.bias
transformer.h.8.ln_2.weight
transformer.h.8.ln_2.bias
transformer.h.8.mlp.c_fc.weight
transformer.h.8.mlp.c_fc.bias
transformer.h.8.mlp.c_proj.weight
transformer.h.8.mlp.c_proj.bias
transformer.h.9.ln_1.weight
transformer.h.9.ln_1.bias
transformer.h.9.attn.bias
transformer.h.9.attn.masked_bias
transformer.h.9.attn.c_attn.weight
transformer.h.9.attn.c_attn.bias
transformer.h.9.attn.c_proj.weight
transformer.h.9.attn.c_proj.bias
transformer.h.9.ln_2.weight
transformer.h.9.ln_2.bias
transformer.h.9.mlp.c_fc.weight
transformer.h.9.mlp.c_fc.bias
transformer.h.9.mlp.c_proj.weight
transformer.h.9.mlp.c_proj.bias
transformer.h.10.ln_1.weight
transformer.h.10.ln_1.bias
transformer.h.10.attn.bias
transformer.h.10.attn.masked_bias
transformer.h.10.attn.c_attn.weight
transformer.h.10.attn.c_attn.bias
transformer.h.10.attn.c_proj.weight
transformer.h.10.attn.c_proj.bias
transformer.h.10.ln_2.weight
transformer.h.10.ln_2.bias
transformer.h.10.mlp.c_fc.weight
transformer.h.10.mlp.c_fc.bias
transformer.h.10.mlp.c_proj.weight
transformer.h.10.mlp.c_proj.bias
transformer.h.11.ln_1.weight
transformer.h.11.ln_1.bias
transformer.h.11.attn.bias
transformer.h.11.attn.masked_bias
transformer.h.11.attn.c_attn.weight
transformer.h.11.attn.c_attn.bias
transformer.h.11.attn.c_proj.weight
transformer.h.11.attn.c_proj.bias
transformer.h.11.ln_2.weight
transformer.h.11.ln_2.bias
transformer.h.11.mlp.c_fc.weight
transformer.h.11.mlp.c_fc.bias
transformer.h.11.mlp.c_proj.weight
transformer.h.11.mlp.c_proj.bias
transformer.ln_f.weight
transformer.ln_f.bias
lm_head.weight

ankit-db commented 1 year ago

@wsxiaoys thank you so much!! I was literally halfway through writing this script myself, but yours looks right and I'll just use that. Appreciate you chiming in here. Were you planning on making a PR against this repo with that script? I'm happy to drive it if you're not going to.

@byshiue thanks for the response here - are you initializing that one as GPT2LMHeadModel?

byshiue commented 1 year ago

@wsxiaoys thank you so much!! I was literally halfway through writing this script myself, but yours looks right and I'll just use that. Appreciate you chiming in here. Were you planning on making a PR against this repo with that script? I'm happy to drive it if you're not going to.

@byshiue thanks for the response here - are you initializing that one as GPT2LMHeadModel?

The converter uses GPTNeoXForCausalLM.

wsxiaoys commented 1 year ago

@wsxiaoys thank you so much!! I was literally halfway through writing this script myself, but yours looks right and I'll just use that. Appreciate you chiming in here.

Glad it works for you

Were you planning on making a PR against this repo with that script? I'm happy to drive it if you're not going to.

Nope. That’ll be great :)

NVIDIA / FasterTransformer