NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k stars 2.28k forks source link

[BUG] Can't continue training from GPT-345M checkpoint with TransformerEngine - RuntimeError: Error(s) in loading state_dict for ParallelTransformer #838

Closed arktoswb closed 4 months ago

arktoswb commented 4 months ago

Describe the bug While running examples/pretrain_gpt.sh from GPT-345M checkpoint I encounter such error:

[rank0]: RuntimeError: Error(s) in loading state_dict for ParallelTransformer:
[rank0]:    Missing key(s) in state_dict: "layers.0.self_attention.layernorm_qkv.layer_norm_weight", ...
[rank0]:    Unexpected key(s) in state_dict: "layers.0.input_norm.weight", ...

To Reproduce

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
unzip megatron_lm_345m_v0.0.zip

Run examples/pretrain_gpt.sh. --attention-softmax-in-fp32 arg is added (does not work otherwise). Also tried llama2 checkpoint. The similar error.

However, the script successfully runs:

  1. From the zero state, and continues running from locally created checkpoint.
  2. With --transformer-impl local from provided GPT-345M checkpoint, but that's deprecated, and will not work with llama models per my understanding.

Expected behavior examples/pretrain_gpt.sh should run fine from GPT-345M checkpoint on the latest release without any modifications.

Stack trace/logs https://gist.github.com/arktoswb/7830a87d514fd53cdad17882128d5122

Environment:

arktoswb commented 4 months ago

The same error inside docker PyTorch Release 23.04: https://gist.github.com/arktoswb/d5835a666e7fcf9bfa3d7ff59173299c

arktoswb commented 4 months ago

Apparently, TransformerEngine is supported with model_type = 'mcore'. So, in order to continue training from GPT-345M checkpoint:

  1. Convert python3 tools/checkpoint/convert.py --model-type GPT --loader megatron --saver megatron
  2. Run training with --use-mcore-models

I will close this issue, but I suggest to edit example scripts and README to make it more clear.

zhaoyz1017 commented 3 months ago

Hello, thanks for suggesting me to use the convert.py, but i still have some problems. Could you please help me take a look at the issue I encountered when using convert? Here is my code: python3 tools/checkpoint/convert.py --model-type GPT --loader megatron --saver megatron --load-dir models/megatron_lm_345m_v0.0 --save-dir models/convert/gpt2 --megatron-path /home/zyz/code/Megatron-LM-core_v0.6.0

Which reports: File "/home/zyz/code/Megatron-LM-core_v0.6.0/tools/checkpoint/loader_megatron.py", line 70, in _load_checkpoint margs, checkpoint_args = load_args_from_checkpoint(margs, exit_on_missing_checkpoint=True) TypeError: cannot unpack non-iterable Namespace object Can you let me see all the code you're using here? thanks again

arktoswb commented 3 months ago

Hello, thanks for suggesting me to use the convert.py, but i still have some problems. Could you please help me take a look at the issue I encountered when using convert? Here is my code: python3 tools/checkpoint/convert.py --model-type GPT --loader megatron --saver megatron --load-dir models/megatron_lm_345m_v0.0 --save-dir models/convert/gpt2 --megatron-path /home/zyz/code/Megatron-LM-core_v0.6.0

Which reports: File "/home/zyz/code/Megatron-LM-core_v0.6.0/tools/checkpoint/loader_megatron.py", line 70, in _load_checkpoint margs, checkpoint_args = load_args_from_checkpoint(margs, exit_on_missing_checkpoint=True) TypeError: cannot unpack non-iterable Namespace object Can you let me see all the code you're using here? thanks again

Yeah, there are multiple problems with that:

  1. --loader megatron: you are loading megatron model
  2. --saver megatron: you are saving megatron model

Megatron core model is --saver mcore.

As for loading you can make edits to the code and hardcode multiple model parameters to make it work:

But honestly you will have easier experience loading llama2 7b model. Even better experience on NeMo - it's in a better shape and also supports llama3

zhaoyz1017 commented 3 months ago

Thank you very much for your response, which has been very helpful.

zixianwang2022 commented 6 days ago

Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.

arktoswb commented 6 days ago

Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.

Yes, I believe it does.

I stopped working with Megatron months ago, so I am not the best person to ask this question.

zixianwang2022 commented 6 days ago

Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.

Yes, I believe it does.

I stopped working with Megatron months ago, so I am not the best person to ask this question.

Thanks for replying! I did not find a flag I can specify the PP and TP at convert.py. Do you have any clues on this? Or do you know anyone who may have some clues?

arktoswb commented 6 days ago

Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.

Yes, I believe it does. I stopped working with Megatron months ago, so I am not the best person to ask this question.

Thanks for replying! I did not find a flag I can specify the PP and TP at convert.py. Do you have any clues on this? Or do you know anyone who may have some clues?

From https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#evaluation-and-tasks: flags --target-tensor-parallel-size and --target-pipeline-parallel-size

zixianwang2022 commented 6 days ago

Hi @arktoswb , does convert.py support convert a pretrained checkpoint with PP=1, TP=1 to either PP>1 and/or TP>1? I want to finetune Mamba 8B from a pretrained checkpoint, but 1 GPU can't afford the memory constraints.

Yes, I believe it does. I stopped working with Megatron months ago, so I am not the best person to ask this question.

Thanks for replying! I did not find a flag I can specify the PP and TP at convert.py. Do you have any clues on this? Or do you know anyone who may have some clues?

From https://github.com/NVIDIA/Megatron-LM?tab=readme-ov-file#evaluation-and-tasks: flags --target-tensor-parallel-size and --target-pipeline-parallel-size

Thanks! It took some effort to solve the import error, but at the end I encountered another error that says for all layers

untimeError: Error(s) in loading state_dict for ParallelTransformer:
        Missing key(s) in state_dict:  "layers.0.input_norm.weight", "layers.0.self_attention.query_key_value.weight", "layers.0.self_attention.dense.wei ...

I think it is the problem with how to enable distributed model for Mamba model specifically. I opened up another github issue on this .

But thanks for directing to the convert.py!

zixianwang2022 commented 6 days ago

Update: convert.py does not support Mamba at the moment, but the hybrid_conversion.py does.