NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment
Apache License 2.0
493 stars 51 forks source link

cannot load reward model from SFT model because of missing keys #137

Open DZ9 opened 4 months ago

DZ9 commented 4 months ago

I converted a llama model to nemo, with model dirs like below: image When I tried to load it to train a reward model, I got missing keys error. I load it from the default config, set load_base_model_only=True, the total load code is as below:

ptl_model = load_from_nemo( reward_model_cls, cfg.model, trainer, strict=True, load_base_model_only=True, restore_path=cfg.pretrained_checkpoint.restore_from_path, )

And then I got the error below, any advice on how to load a pretrained non-reward model to train as a reward model in Nemo?

Error executing job with overrides: []
Traceback (most recent call last):
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 206, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/checkpoint/binary/train_package/train_reward_model.py", line 68, in main
    ptl_model = load_from_nemo(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 96, in load_from_nemo
    model = cls.restore_from(
  File "/checkpoint/binary/train_package/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/checkpoint/binary/train_package/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/checkpoint/binary/train_package/nemo_aligner/utils/utils.py", line 52, in restore_from
    output = super().restore_from(*args, **kwargs)
  File "/checkpoint/binary/train_package/nemo/collections/nlp/parts/nlp_overrides.py", line 1123, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 120, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 221, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 180, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/dict_utils.py", line 184, in dict_list_map_inplace
    return f(x)
  File "/checkpoint/binary/train_package/megatron/core/dist_checkpointing/serialization.py", line 218, in load_sharded_object
    raise CheckpointingException(err_msg) from e
megatron.core.dist_checkpointing.core.CheckpointingException: Object shard /mnt/workspace/models/LLaMA-2-7B-32K-Nemo-Official/model_weights/model.rm_head._extra_state/shard_0_1.pt not found
DZ9 commented 4 months ago

anybody can please help with this?

odelalleau commented 4 months ago

Did you try with strict=False?

gshennvm commented 4 months ago

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

DZ9 commented 4 months ago

Did you try with strict=False?

yes, it didn't work either

DZ9 commented 4 months ago

do you know if this is a mcore based model? and is this SFTed with aligner?

you can tell if it's a mcore based model by looking at the model_weights directory it should have common.pt and metadata.json

yes it is a mcore based model image

DZ9 commented 4 months ago

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

gshennvm commented 4 months ago

I manually deleted all rm_head related keys during restore and it now works fine. But I think it is a bug imported because of change of megatron.

ah okay! that's good to know. can you elaborate on the change of megatron? was your model SFTed in a previous container?

odelalleau commented 4 months ago

To elaborate, it'd be helpful if you could share the exact steps you used when you said "I converted a llama model to nemo", so that we can reproduce the issue. Which container did you use and which commands did you run?

berserkr commented 2 weeks ago

Running into similar issue here - Any leads? Removing the rm head will damage the model itself no?

odelalleau commented 2 weeks ago

Running into similar issue here - Any leads? Removing the rm head will damage the model itself no?

It would help to share exact steps (including which container is used and version of NeMo-Aligner) so as to be able to reproduce this issue.

I'm not actually sure what you mean by "removing the RM head" either -- obviously without the head you wouldn't be able to use the model, but if it's just a temporary hack to skip trying to restore the head during loading of the model, it shouldn't matter (because the RM head doesn't exist in the SFT checkpoint anyway, it's supposed to get initialized randomly when initializing from an SFT model).