Closed taozhang9527 closed 9 months ago
Tried the latest docker image: nemofw-training:23.11
. Still failed, but the error message is kind of different.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to ma
ke to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping
to make to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make
to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make
to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping t
o make to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to mak
e to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1554] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to mak
e to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1554] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make i
t configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_m
apping to make to make it configurable.
[NeMo W 2024-01-04 04:44:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py:643: UserWarning: To guarantee overlapping TP and SP collecti
ves with the backwardGEMMs, set environment variable CUDA_DEVICE_MAX_CONNECTIONS = 1
warnings.warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.devices=8', 'trainer.num_nodes=1', 'trainer.val_check_interval=0.1', 'trainer.max_steps=50', 'model.restore_from_path
=/workspace/llama2-7b.nemo', 'model.micro_batch_size=1', 'model.global_batch_size=128', 'model.tensor_model_parallel_size=8', 'model.pipeline_model_parallel_size=1', 'model.megatron_amp_O2=
True', 'model.sequence_parallel=True', 'model.activations_checkpoint_granularity=selective', 'model.activations_checkpoint_method=uniform', 'model.optim.name=distributed_fused_adam', 'model
.optim.lr=5e-6', 'model.answer_only_loss=True', 'model.data.train_ds.file_names=', 'model.data.validation_ds.file_names=', 'model.data.test_ds.file_names=', 'model.data.train_ds.concat_samp
ling_probabilities=[1]', 'model.data.train_ds.max_seq_length=2048', 'model.data.validation_ds.max_seq_length=2048', 'model.data.train_ds.micro_batch_size=1', 'model.data.train_ds.global_bat
ch_size=128', 'model.data.validation_ds.micro_batch_size=1', 'model.data.validation_ds.global_batch_size=128', 'model.data.test_ds.micro_batch_size=1', 'model.data.test_ds.global_batch_size
=256', 'model.data.train_ds.num_workers=0', 'model.data.validation_ds.num_workers=0', 'model.data.test_ds.num_workers=0', 'model.data.validation_ds.metric.name=loss', 'model.data.test_ds.me
tric.name=loss', 'exp_manager.create_wandb_logger=False', 'exp_manager.explicit_log_dir=/results', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_
manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=validation_loss', 'exp_manager.checkpoint_callback_params.save_best_model=False', 'exp_manager.chec
kpoint_callback_params.save_nemo_on_train_end=True', '++cluster_type=BCP']
Traceback (most recent call last):
File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 218, in main
model = load_from_nemo(MegatronGPTSFTModel, cfg, trainer, gpt_cfg, modify_confg_fn=_modify_config)
File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 113, in load_from_nemo
model = cls.restore_from(
File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
return super().restore_from(
File "/opt/NeMo/nemo/core/classes/modelPT.py", line 442, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 742, in restore_from
checkpoint = dist_checkpointing.load(
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 75, in load
sharded_objects, sharded_state_dict = load_sharded_objects(sharded_state_dict, checkpoint_dir)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 129, in load_sharded_objects
return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
File "/opt/megatron-lm/megatron/core/dist_checkpointing/dict_utils.py", line 170, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/dict_utils.py", line 170, in dict_list_map_inplace
x[k] = dict_list_map_inplace(f, v)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/dict_utils.py", line 174, in dict_list_map_inplace
return f(x)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 126, in load_sharded_object
loaded_obj = torch.load(load_path)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 988, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 437, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 418, in __init__
super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpg9bo8424/model_weights/model.module.decoder.layers.mlp.linear_fc1._extra_state/shard_0_32.pt'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-01-04 04:44:32,206] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1890 closing signal SIGTERM
[2024-01-04 04:44:32,207] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1891 closing signal SIGTERM
[2024-01-04 04:44:32,207] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1892 closing signal SIGTERM
[2024-01-04 04:44:32,208] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1893 closing signal SIGTERM
[2024-01-04 04:44:32,209] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1894 closing signal SIGTERM
[2024-01-04 04:44:32,209] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1895 closing signal SIGTERM
[2024-01-04 04:44:32,210] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1896 closing signal SIGTERM
[2024-01-04 04:44:32,325] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1889) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-04_04:44:32
host : node073.cm.cluster
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1889)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Describe the bug
I am following the official instructions to fine-tune Llama-7b-hf model on a 8xH100 machine. It gave me the following error even I set the ${CONCAT_SAMPLING_PROBS}="[1]".
Steps/Code to reproduce bug
export ${CONCAT_SAMPLING_PROBS}="[1]"
It seems the following code in
/megatron_gpt_sft_model.py
is always triggered.**Environment overview
I am using docker image
nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03
. OS isUbuntu 20.04.6.LTS
.