Llama SFT - Single Dataset File Not Working Even With ${CONCAT_SAMPLING_PROBS}="[1]"

Describe the bug

I am following the official instructions to fine-tune Llama-7b-hf model on a 8xH100 machine. It gave me the following error even I set the ${CONCAT_SAMPLING_PROBS}="[1]".

  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 941, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 86, in _call_setup_hook
    _call_lightning_module_hook(trainer, "setup", stage=fn)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 145, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 184, in setup
    self.build_train_valid_test_datasets(stage=stage)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 754, in build_train_valid_test_datasets
    self._validation_ds = self._build_dataset(self.cfg.data.validation_ds, is_train=False)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 218, in _build_dataset
    raise ValueError(f"SFT train/validation datasets must be provided as a list of individual JSONL files.")
**ValueError: SFT train/validation datasets must be provided as a list of individual JSONL files.**

Steps/Code to reproduce bug

export ${CONCAT_SAMPLING_PROBS}="[1]"

torchrun --nproc_per_node=8 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py \
   trainer.precision=bf16 \
   trainer.devices=8 \
   trainer.num_nodes=1 \
   trainer.val_check_interval=0.1 \
   trainer.max_steps=50 \
   model.restore_from_path=${MODEL} \
   model.micro_batch_size=1 \
   model.global_batch_size=128 \
   model.tensor_model_parallel_size=${TP_SIZE} \
   model.pipeline_model_parallel_size=${PP_SIZE} \
   model.megatron_amp_O2=True \
   model.sequence_parallel=True \
   model.activations_checkpoint_granularity=selective \
   model.activations_checkpoint_method=uniform \
   model.optim.name=distributed_fused_adam \
   model.optim.lr=5e-6 \
   model.answer_only_loss=True \
   model.data.train_ds.file_names=${TRAIN_DS} \
   model.data.validation_ds.file_names=${VALID_DS} \
   model.data.test_ds.file_names=${TEST_DS} \
   model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
   model.data.train_ds.max_seq_length=2048 \
   model.data.validation_ds.max_seq_length=2048 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=128 \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.test_ds.micro_batch_size=1 \
   model.data.test_ds.global_batch_size=256 \
   model.data.train_ds.num_workers=0 \
   model.data.validation_ds.num_workers=0 \
   model.data.test_ds.num_workers=0 \
   model.data.validation_ds.metric.name=loss \
   model.data.test_ds.metric.name=loss \
   exp_manager.create_wandb_logger=False \
   exp_manager.explicit_log_dir=/results \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss \
   exp_manager.checkpoint_callback_params.save_best_model=False \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   ++cluster_type=BCP

It seems the following code in /megatron_gpt_sft_model.py is always triggered.

    def _build_dataset(self, data_cfg, is_train=True):
        datasets = []
        # Determine if we are using a single dataset or a list of datasets.
        is_list_config = isinstance(data_cfg.file_names, ListConfig)
        if not is_list_config:
            raise ValueError(f"SFT train/validation datasets must be provided as a list of individual JSONL files.")

**Environment overview

I am using docker image nvcr.io/ea-bignlp/ga-participants/nemofw-training:23.08.03. OS is Ubuntu 20.04.6.LTS.

Tried the latest docker image: nemofw-training:23.11. Still failed, but the error message is kind of different.

[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to ma
ke to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping
to make to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to
 make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to
 make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make
to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make
to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping t
o make to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to mak
e to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1554] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_base_model:821] The model: MegatronGPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to mak
e to make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1554] apply_query_key_layer_scaling is only enabled when using FP16, setting it to False and setting NVTE_APPLY_QK_LAYER_SCALING=0
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to
make it configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make i
t configurable.
[NeMo W 2024-01-04 04:44:03 megatron_gpt_model:1619] The model: MegatronGPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_m
apping to make to make it configurable.
[NeMo W 2024-01-04 04:44:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py:643: UserWarning: To guarantee overlapping TP and SP collecti
ves with the backwardGEMMs, set environment variable CUDA_DEVICE_MAX_CONNECTIONS = 1
      warnings.warn(

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.devices=8', 'trainer.num_nodes=1', 'trainer.val_check_interval=0.1', 'trainer.max_steps=50', 'model.restore_from_path
=/workspace/llama2-7b.nemo', 'model.micro_batch_size=1', 'model.global_batch_size=128', 'model.tensor_model_parallel_size=8', 'model.pipeline_model_parallel_size=1', 'model.megatron_amp_O2=
True', 'model.sequence_parallel=True', 'model.activations_checkpoint_granularity=selective', 'model.activations_checkpoint_method=uniform', 'model.optim.name=distributed_fused_adam', 'model
.optim.lr=5e-6', 'model.answer_only_loss=True', 'model.data.train_ds.file_names=', 'model.data.validation_ds.file_names=', 'model.data.test_ds.file_names=', 'model.data.train_ds.concat_samp
ling_probabilities=[1]', 'model.data.train_ds.max_seq_length=2048', 'model.data.validation_ds.max_seq_length=2048', 'model.data.train_ds.micro_batch_size=1', 'model.data.train_ds.global_bat
ch_size=128', 'model.data.validation_ds.micro_batch_size=1', 'model.data.validation_ds.global_batch_size=128', 'model.data.test_ds.micro_batch_size=1', 'model.data.test_ds.global_batch_size
=256', 'model.data.train_ds.num_workers=0', 'model.data.validation_ds.num_workers=0', 'model.data.test_ds.num_workers=0', 'model.data.validation_ds.metric.name=loss', 'model.data.test_ds.me
tric.name=loss', 'exp_manager.create_wandb_logger=False', 'exp_manager.explicit_log_dir=/results', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_
manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.monitor=validation_loss', 'exp_manager.checkpoint_callback_params.save_best_model=False', 'exp_manager.chec
kpoint_callback_params.save_nemo_on_train_end=True', '++cluster_type=BCP']
Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 218, in main
    model = load_from_nemo(MegatronGPTSFTModel, cfg, trainer, gpt_cfg, modify_confg_fn=_modify_config)
  File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 113, in load_from_nemo
    model = cls.restore_from(
  File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/opt/NeMo/nemo/core/classes/modelPT.py", line 442, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 742, in restore_from
    checkpoint = dist_checkpointing.load(
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 75, in load
    sharded_objects, sharded_state_dict = load_sharded_objects(sharded_state_dict, checkpoint_dir)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 129, in load_sharded_objects
    return dict_list_map_inplace(load_sharded_object, sharded_objects), sharded_state_dict
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/dict_utils.py", line 170, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/dict_utils.py", line 170, in dict_list_map_inplace
    x[k] = dict_list_map_inplace(f, v)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/dict_utils.py", line 174, in dict_list_map_inplace
    return f(x)
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 126, in load_sharded_object
    loaded_obj = torch.load(load_path)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 988, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 437, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 418, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpg9bo8424/model_weights/model.module.decoder.layers.mlp.linear_fc1._extra_state/shard_0_32.pt'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-01-04 04:44:32,206] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1890 closing signal SIGTERM
[2024-01-04 04:44:32,207] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1891 closing signal SIGTERM
[2024-01-04 04:44:32,207] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1892 closing signal SIGTERM
[2024-01-04 04:44:32,208] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1893 closing signal SIGTERM
[2024-01-04 04:44:32,209] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1894 closing signal SIGTERM
[2024-01-04 04:44:32,209] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1895 closing signal SIGTERM
[2024-01-04 04:44:32,210] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1896 closing signal SIGTERM
[2024-01-04 04:44:32,325] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1889) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-04_04:44:32
  host      : node073.cm.cluster
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1889)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

NVIDIA / NeMo

Llama SFT - Single Dataset File Not Working Even With ${CONCAT_SAMPLING_PROBS}="[1]" #8092