epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
529 stars 76 forks source link

Error during merge of sharded checkpoint: 'TransformerLanguageModel' object has no attribute 'lm_head' #14

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

While merging a sharded llama2 7b tp2-pp2 checkpoint the exception AttributeError: 'TransformerLanguageModel' object has no attribute 'lm_head' is thrown here.

Traceback

Traceback (most recent call last):
  File "/root/koepf/epfl-megatron/tools/checkpoint_util.py", line 152, in <module>
    main()
  File "/root/koepf/epfl-megatron/tools/checkpoint_util.py", line 145, in main
    loader.load_checkpoint(queue, args)
  File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 319, in load_checkpoint
    _load_checkpoint(queue, args)
  File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 221, in _load_checkpoint
    queue_put("lm_head", {"lm_head": torch.cat([models[tp_rank].language_model.lm_head.data
  File "/root/koepf/epfl-megatron/tools/checkpoint_loader_megatron.py", line 221, in <listcomp>
    queue_put("lm_head", {"lm_head": torch.cat([models[tp_rank].language_model.lm_head.data
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1630, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'TransformerLanguageModel' object has no attribute 'lm_head'

Command used

python tools/checkpoint_util.py --target_tensor_parallel_size 1 --target_pipeline_parallel_size 1 --load_dir /root/koepf/megatron-data/checkpoints/llama2-7b-tp2-pp2-trained/ --save_dir /root/koepf/megatron-data/llama2-7b-out --model_type llama2 --bf16

AleHD commented 1 year ago

Fixed with 415c1dc.