RuntimeError "Unexpected key" when running checkpoint_converters script convert_got_nemo_to_mcore.py

renweizhukov commented 2 months ago

Describe the bug

Steps/Code to reproduce bug

Follow the instructions given in https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/dpo.html to convert a GPT-2B checkpoint to Megatron-Core checkpoint.

Download the 2B checkpoint.

wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo

Extract the NeMo file to a folder.

mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint

Run the script to convert from old NeMo checkpoint to Megatron-Core checkpoint. The original linked script convert_nemo_gpt_to_mcore.py has been replaced with convert_gpt_nemo_to_mcore.py.

python convert_gpt_nemo_to_mcore.py --input_name_or_path ./model_checkpoint --output_path ./mcore_gpt.nemo

Hit the following RuntimeError.

[NeMo W 2024-06-29 08:37:13 megatron_gpt_model:327] megatron_amp_O2 is enabled but transformer-engine is not.
Traceback (most recent call last):
  File "/workspace/src/3rdparty/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py", line 323, in <module>
    convert(
  File "/workspace/src/3rdparty/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py", line 244, in convert
    mcore_model = load_model(mcore_model, mcore_state_dict, ignore_if_missing=ignore_if_missing)
  File "/workspace/src/3rdparty/NeMo/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py", line 172, in load_model
    raise RuntimeError(f"Unexpected key: {name} not in state_dict but in model.")
RuntimeError: Unexpected key: model.module.decoder.layers.0.input_layernorm.weight not in state_dict but in model.

Expected behavior

The script should write the converted checkpoint under the given output_path.

Environment overview (please complete the following information)

Environment location: Docker

Method of NeMo install: from source

# Checked out at commit e33c8f7
uv pip install --system --no-build-isolation --no-deps -e /workspace/src/3rdparty/Megatron-LM

# Checked out at commit f1062b7
uv pip install --system --no-build-isolation --no-deps -e "/workspace/src/3rdparty/NeMo[nlp]" && cd /workspace/src/3rdparty/NeMo/nemo/collections/nlp/data/language_modeling/megatron && make

If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version: Ubuntu 22.04.4 LTS
PyTorch version: 2.3.0a0+6ddf5cf85e.nv24.04
Python version: Python 3.10.12

Additional context

Add any other context about the problem here.

GPU model: NVIDIA A100-SXM4-80GB

yaoyu-33 commented 1 month ago

Hi, try to follow up here: I think we have few different version of NeMo checkpoints at the moment. The current script works for more recent nemo checkpoints but maybe not 2B.