NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
10.79k stars 2.26k forks source link

Converting megatron checkpoint to .nemo without the same environment #9516

Open dachenlian opened 1 week ago

dachenlian commented 1 week ago

I have multiple checkpoints produced after running examples/nlp/language_modeling/megatron_gpt_continue_training.py.

However, I am unable to use examples/nlp/language_modeling/megatron_ckpt_to_nemo.py to convert it to a .nemo object. It's probably because the environment in which I want to do the conversion is the not the same as the one used for training. Is there some way to do the conversion only on CPU or just 1 GPU?

I have tried using two different sets of parameters:


Input:

srun -p ${PARTITION} -G 1 \
    --container-image /mnt/nemo_dev.sqsh \
    --container-mounts ${LAUNCHER_SCRIPT_PATH}:${LAUNCHER_SCRIPT_PATH},${NEMO_PATH}:${NEMO_PATH} \
    --container-writable \
    --no-container-mount-home \
    --pty bash -c \
    "python ${NEMO_PATH}/examples/nlp/language_modeling/megatron_ckpt_to_nemo.py --checkpoint_folder ${CKPT_FOLDER} --checkpoint_name ${CKPT_NAME} --nemo_file_path ${NEMO_FILE_PATH} --model_type gpt --gpus_per_node 1 --tensor_model_parallel_size 1 --pipeline_model_parallel_size 1 --precision bf16-mixed"

Output:

megatron_ckpt_to_nemo.py 243 <module>                                                                                                                                     
convert(local_rank, rank, world_size, args)                                                                                                                               

megatron_ckpt_to_nemo.py 196 convert
model = MegatronGPTModel.load_from_checkpoint(checkpoint_path, hparams_file=args.hparams_file, trainer=trainer)

nlp_model.py 397 load_from_checkpoint
checkpoint = dist_checkpointing.load(sharded_state_dict=checkpoint, checkpoint_dir=checkpoint_dir)

serialization.py 131 load
validate_sharding_integrity(nested_values(sharded_state_dict))

serialization.py 404 validate_sharding_integrity
_validate_sharding_for_key(shardings)

serialization.py 442 _validate_sharding_for_key
raise CheckpointingException(f'Invalid access pattern for {rank_sharding[0][1]}')

megatron.core.dist_checkpointing.core.CheckpointingException:
Invalid access pattern for ShardedTensor(key='model.embedding.word_embeddings.weight')

Input:

srun -p ${PARTITION} -G 1 \
    --container-image /mnt/nemo_dev.sqsh \
    --container-mounts ${LAUNCHER_SCRIPT_PATH}:${LAUNCHER_SCRIPT_PATH},${NEMO_PATH}:${NEMO_PATH} \
    --container-writable \
    --no-container-mount-home \
    --pty bash -c \
    "python ${NEMO_PATH}/examples/nlp/language_modeling/megatron_ckpt_to_nemo.py --checkpoint_folder ${CKPT_FOLDER} --checkpoint_name ${CKPT_NAME} --nemo_file_path ${NEMO_FILE_PATH} --model_type gpt --gpus_per_node 8 --tensor_model_parallel_size 2 --pipeline_model_parallel_size 1 --precision bf16-mixed"

Output:

megatron_ckpt_to_nemo.py 243 <module>
convert(local_rank, rank, world_size, args)

megatron_ckpt_to_nemo.py 153 convert
trainer = Trainer(plugins=plugins, strategy=strategy, **cfg.trainer)

argparse.py 70 insert_env_defaults
return fn(self, **kwargs)

trainer.py 401 __init__
self._accelerator_connector = _AcceleratorConnector(

accelerator_connector.py 149 __init__
self._check_device_config_and_set_final_flags(devices=devices, num_nodes=num_nodes)

accelerator_connector.py 325 _check_device_config_and_set_final_flags
raise ValueError(f"`num_nodes` must be a positive integer, but got {num_nodes}.")

ValueError:
`num_nodes` must be a positive integer, but got 0.