NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.45k stars 2.39k forks source link

RuntimeError "Unexpected key" when running checkpoint_converters script convert_got_nemo_to_mcore.py #9626

Closed renweizhukov closed 3 weeks ago

renweizhukov commented 2 months ago

Describe the bug

RuntimeError "Unexpected key" when running checkpoint_converters script convert_got_nemo_to_mcore.py

Steps/Code to reproduce bug

Follow the instructions given in https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/dpo.html to convert a GPT-2B checkpoint to Megatron-Core checkpoint.

  1. Download the 2B checkpoint.
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
  1. Extract the NeMo file to a folder.
mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint
  1. Run the script to convert from old NeMo checkpoint to Megatron-Core checkpoint. The original linked script convert_nemo_gpt_to_mcore.py has been replaced with convert_gpt_nemo_to_mcore.py.
python convert_gpt_nemo_to_mcore.py --input_name_or_path ./model_checkpoint --output_path ./mcore_gpt.nemo
  1. Hit the following RuntimeError.
[NeMo W 2024-06-29 08:37:13 megatron_gpt_model:327] megatron_amp_O2 is enabled but transformer-engine is not.
Traceback (most recent call last):
  File "/workspace/src/3rdparty/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py", line 323, in <module>
    convert(
  File "/workspace/src/3rdparty/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py", line 244, in convert
    mcore_model = load_model(mcore_model, mcore_state_dict, ignore_if_missing=ignore_if_missing)
  File "/workspace/src/3rdparty/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py", line 172, in load_model
    raise RuntimeError(f"Unexpected key: {name} not in state_dict but in model.")
RuntimeError: Unexpected key: model.module.decoder.layers.0.input_layernorm.weight not in state_dict but in model.

Expected behavior

The script should write the converted checkpoint under the given output_path.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here.

GPU model: NVIDIA A100-SXM4-80GB

yaoyu-33 commented 1 month ago

Hi, try to follow up here: I think we have few different version of NeMo checkpoints at the moment. The current script works for more recent nemo checkpoints but maybe not 2B.

renweizhukov commented 1 month ago

The error is not caused by the input NeMo checkpoint. It is caused by this change in the output mcore checkpoint, i.e., these two sets of weight and bias have been renamed according to the module_name_rewrite_list given in https://github.com/NVIDIA/Megatron-LM/blob/e33c8f78a35765d5aa37475a144da60e8a2349d1/megatron/core/inference/gpt/state_dict_hooks.py#L116-L119

renweizhukov commented 1 month ago

@yaoyu-33 Please let me know if you need more info about this issue or the pull request. Thanks!

yaoyu-33 commented 1 month ago

Hi sorry for the delay. Yes, it makes sense now. added a comment in your PR. Can you sign your PR when you commit by commit -sm "commit msg"

renweizhukov commented 1 month ago

@yaoyu-33 No problem. Thank you for the response! I addressed your comment in my PR. I signed my PR and rebased it onto the latest main.

renweizhukov commented 1 month ago

@yaoyu-33 Gentle ping. Please let me know if you have any other comment. Thanks!

yaoyu-33 commented 1 month ago

hi @renweizhukov we should probably change others then. We try to keep the core args the same (using _) across the checkpoint_converters folder.

After this change, I think it's good to go

renweizhukov commented 1 month ago

@yaoyu-33 Make sense. I have changed the hyphens to underscores for the two other command-line options.

renweizhukov commented 1 month ago

@yaoyu-33 I have made the change per your suggestion. Could you please take a look? Thanks!

yaoyu-33 commented 1 month ago

@yaoyu-33 I have made the change per your suggestion. Could you please take a look? Thanks!

will get it merged today

renweizhukov commented 3 weeks ago

@yaoyu-33 Just wonder if we have merged the pull request.

yaoyu-33 commented 3 weeks ago

there were some ci issues last week, finally merged.

renweizhukov commented 3 weeks ago

Great. Thank you!