[Inference] Support GPT-J-6B

Leezekun commented 2 years ago

Thanks a lot for your great work. I want to use LightSeq to speed up the inference of a large transformer model GPT-J-6B, which has been available for the public: huggingface/transformers#13022. It is a GPT-2-like causal language model but much larger than GPT-2.

I saw there is an example of exporting GPT-2 model for LightSeq inference. So I wonder if LightSeq can also support GPT-J-6B inference.

I tried to modify the parameter set in lightseq/examples/inference/python/export/hf_gpt2_export.py from the parameters of GPT-2:

enc_layer_mapping_dict = OrderedDict(
    {
        "multihead_norm_scale": "ln_1 weight",
        "multihead_norm_bias": "ln_1 bias",
        # GPT2's Conv1D don't need transpose
        # https://github.com/huggingface/transformers/blob/9ec0f01b6c3aff4636869aee735859fb6f89aa98/src/transformers/modeling_utils.py#L1400
        "multihead_project_kernel_qkv": "attn c_attn weight",
        "multihead_project_bias_qkv": "attn c_attn bias",
        "multihead_project_kernel_output": "attn c_proj weight",
        "multihead_project_bias_output": "attn c_proj bias",
        "ffn_norm_scale": "ln_2 weight",
        "ffn_norm_bias": "ln_2 bias",
        "ffn_first_kernel": "mlp c_fc weight",
        "ffn_first_bias": "mlp c_fc bias",
        "ffn_second_kernel": "mlp c_proj weight",
        "ffn_second_bias": "mlp c_proj bias",
    }
)

src_emb_mapping_dict = OrderedDict(
    {
        "norm_scale": "ln_f weight",
        "norm_bias": "ln_f bias",
        "token_embedding": "wte",
        # manually process position_embedding to customize for max_step
        # "position_embedding": "wpe",
    }
)

to the parameters of GPT-J-6B:

enc_layer_mapping_dict = OrderedDict(
    {
        "multihead_norm_scale": "ln_1 weight",
        "multihead_norm_bias": "ln_1 bias",
        "multihead_project_k": "attn k_proj weight",
        "multihead_project_v": "attn v_proj weight",
        "multihead_project_q": "attn q_proj weight",
        "multihead_project_output": "attn out_proj weight",
        "ffn_in_scale": "mlp fc_in weight",
        "ffn_in_bias": "mlp fc_in bias",
        "ffn_out_scale": "mlp fc_out weight",
        "ffn_out_bias": "mlp fc_out bias"
    }
)

src_emb_mapping_dict = OrderedDict(
    {
        "norm_scale": "ln_f weight",
        "norm_bias": "ln_f bias",
        "lm_head_scale": "lm_head weight",
        "lm_head_bias": "lm_head bias",
        "token_embedding": "wte",
        # manually process position_embedding to customize for max_step
        # "position_embedding": "wpe",
    }
)

However, there are still many other settings that are different. For example, there is no position embedding parameters in GPT-J-6B pretrained model. So I don't know how to convert the GPT-J-6B model to hdf5 format and speed up its inference.

Do you know how to solve the problem?

Thanks again!

neopro12 commented 2 years ago

For language model scoring like ppl, it will be ok. For generation, there may be problems caused by OOM of gpu memory. You can fill in zeros for position embedding

Leezekun commented 2 years ago

Thank you for your reply! There is no OOM problem. I have filled zeros for position embedding and transformed the model to hdf5 format successfully. But when I tried to load the GPT-J-6B model in hdf5 format:

ls_model = lsi.Gpt("lightseq_gptj_6b_fp16.hdf5", max_batch_size=16)

The errors occurred:

RuntimeError: encoder_stack/0/ffn_first_kernel Not Found in HDF5 File

I know I should change the parameter names in gpt.proto, gpt_weight.cc and gpt_pb2.py files (or I can create gptj.proto, gptj_weight.cc and gpt_pb2.py files), but there are too many places needed to be changed. So I am not sure how to change them, and if there are other files needed to be changed.

So I wonder do you have a plan to add the code to support GPT-J-6B inference? I think that would be very helpful.

misska1 commented 2 years ago

I am looking forward to your support for GPT-J-6B too!😄

Taka152 commented 2 years ago

@Leezekun I think only changes of the export script are needed to support GPT-J if the model structures are the same and only variable names are different. BTW, Can you show the full error log, this error may not be the point.

Leezekun commented 2 years ago

@Leezekun I think only changes of the export script are needed to support GPT-J if the model structures are the same and only variable names are different. BTW, Can you show the full error log.

I think their model structures are slightly different. So I tried to modify the parameter set in lightseq/examples/inference/python/export/hf_gpt2_export.py from the parameters of GPT-2:

enc_layer_mapping_dict = OrderedDict(
    {
        "multihead_norm_scale": "ln_1 weight",
        "multihead_norm_bias": "ln_1 bias",
        # GPT2's Conv1D don't need transpose
        # https://github.com/huggingface/transformers/blob/9ec0f01b6c3aff4636869aee735859fb6f89aa98/src/transformers/modeling_utils.py#L1400
        "multihead_project_kernel_qkv": "attn c_attn weight",
        "multihead_project_bias_qkv": "attn c_attn bias",
        "multihead_project_kernel_output": "attn c_proj weight",
        "multihead_project_bias_output": "attn c_proj bias",
        "ffn_norm_scale": "ln_2 weight",
        "ffn_norm_bias": "ln_2 bias",
        "ffn_first_kernel": "mlp c_fc weight",
        "ffn_first_bias": "mlp c_fc bias",
        "ffn_second_kernel": "mlp c_proj weight",
        "ffn_second_bias": "mlp c_proj bias",
    }
)

src_emb_mapping_dict = OrderedDict(
    {
        "norm_scale": "ln_f weight",
        "norm_bias": "ln_f bias",
        "token_embedding": "wte",
        # manually process position_embedding to customize for max_step
        # "position_embedding": "wpe",
    }
)

to the parameters of GPT-J-6B:

enc_layer_mapping_dict = OrderedDict(
    {
        "multihead_norm_scale": "ln_1 weight",
        "multihead_norm_bias": "ln_1 bias",
        "multihead_project_k": "attn k_proj weight",
        "multihead_project_v": "attn v_proj weight",
        "multihead_project_q": "attn q_proj weight",
        "multihead_project_output": "attn out_proj weight",
        "ffn_in_scale": "mlp fc_in weight",
        "ffn_in_bias": "mlp fc_in bias",
        "ffn_out_scale": "mlp fc_out weight",
        "ffn_out_bias": "mlp fc_out bias"
    }
)

src_emb_mapping_dict = OrderedDict(
    {
        "norm_scale": "ln_f weight",
        "norm_bias": "ln_f bias",
        "lm_head_scale": "lm_head weight",
        "lm_head_bias": "lm_head bias",
        "token_embedding": "wte",
        # manually process position_embedding to customize for max_step
        # "position_embedding": "wpe",
    }
)

And I have filled zeros for position embedding. The model was successfully saved in hdf5 format, which is named lightseq_gptj_6b_fp16.hdf5. But when I tried to run python test/ls_gptj.py, I got the following error message:

initializing gpt tokenizer...
lightseq tokenizer pad token id: 50257
huggingface tokenizer pad token id: 50256
creating lightseq model...
Parsing hdf5: /mnt/nvme/zekun/cached_models/lightseq_gptj_6b_fp16.hdf5
Traceback (most recent call last):
  File "test/ls_gptj.py", line 119, in <module>
    main()
  File "test/ls_gptj.py", line 79, in main
    ls_model = lsi.Gpt("/mnt/nvme/zekun/cached_models/lightseq_gptj_6b_fp16.hdf5", max_batch_size=16)
RuntimeError: encoder_stack/0/ffn_first_kernel Not Found in HDF5 File

Apparently, the encoder_stack/0/ffn_first_kernel is the parameter name of GPT-2 but not GPT-J. So I think I should change the parameter names in gpt.proto, gpt_weight.cc and gpt_pb2.py files (or I can create gptj.proto, gptj_weight.cc and gptj_pb2.py files), but there are too many places needed to be changed. So I am not sure how to change them, and if there are other files needed to be changed.

Can you tell if I am wrong? If so, how to change the code? Or do you plan to add the support of GPT-J-6B inference?

Thanks!

Taka152 commented 2 years ago

@Leezekun Passing zeros to position embedding is acceptable, the error is from your deletion of the keys.

The keys in mapping_dict are fixed and shouldn't be deleted. You can modify the value of each key, they are keywords in the Pytorch tensor name.

For example, you delete the ffn_first_kernel key, which should be the first mlp weight in here, that causes the Not Found Error.

After glancing at the code of GPT-J, I think the modification of values of mapping_dict should be enough to support GPT-J, I can support you do this if you need help.

You can join our lark group to reach us instantly.

Leezekun commented 2 years ago

@Taka152 I will appreciate it if you can help me with that! Thanks!

Leezekun commented 2 years ago

@Taka152 Hi, can you help me with that? Thanks so much! Besides, I don't know how to join your organization to reach you. It said I needed a invite code.

Taka152 commented 2 years ago

@Leezekun Are you able to join lightseq open group in the doc using lark?

bytedance / lightseq

[Inference] Support GPT-J-6B #245