Closed PurvangL closed 4 months ago
This is due to the /tmp
folder inside your container does not have enough space. Because NeMo will untar the .nemo
file into that folder, for 70B model, it needs a lot of space. You may mount an empty dir in host to /tmp
in your container.
Right it's better to untar such large models with tar -xvf xyz.nemo /path and then use save restore connector to restore the model by explicitly stating the path of the extracted dir. There are some examples of this in inference scripts in LLM directories
Here is an example - https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_eval.py#L185-L187 and pass the connector to restore_from https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_eval.py#L238
Thanks @qijiaxing and @titu1994 for reply. applying tar command on .nemo files and updating model.restore_from_path to path solves issue I was facing.
@qijiaxing @titu1994 , reopening as getting following error after training all steps.
cmd
WORLD_SIZE=16 srun --kill-on-bad-exit=0 -N 2 --ntasks-per-node=8 --cpus-per-task=24 --ntasks=16 --container-image="docker://nvcr.io#nvidia/nemo:24.01.01.framework" --container-name=nemo_llama_slurm --container-mounts="${_cont_mounts}" python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py trainer.precision=bf16 trainer.devices=8 trainer.num_nodes=2 trainer.val_check_interval=1.0 trainer.max_steps=5 model.restore_from_path=${MODEL} model.micro_batch_size=1 model.global_batch_size=128 model.tensor_model_parallel_size=${TP_SIZE} model.activations_checkpoint_num_layers=1 model.pipeline_model_parallel_size=${PP_SIZE} model.megatron_amp_O2=True model.sequence_parallel=False model.activations_checkpoint_granularity=full model.activations_checkpoint_method=uniform model.optim.name=distributed_fused_adam model.optim.lr=5e-6 model.answer_only_loss=True model.data.train_ds.file_names=${TRAIN} model.data.validation_ds.file_names=${VALID} model.data.test_ds.file_names=${TEST} model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} model.data.train_ds.max_seq_length=512 model.data.validation_ds.max_seq_length=512 model.data.train_ds.micro_batch_size=1 model.data.train_ds.global_batch_size=128 model.data.validation_ds.micro_batch_size=1 model.data.validation_ds.global_batch_size=128 model.data.test_ds.micro_batch_size=1 model.data.test_ds.global_batch_size=256 model.data.train_ds.num_workers=0 model.data.validation_ds.num_workers=0 model.data.test_ds.num_workers=0 model.data.validation_ds.metric.name=loss model.data.test_ds.metric.name=loss exp_manager.create_wandb_logger=False exp_manager.explicit_log_dir=/workspace/result exp_manager.resume_if_exists=True exp_manager.resume_ignore_no_checkpoint=True exp_manager.create_checkpoint_callback=True exp_manager.checkpoint_callback_params.monitor=validation_loss exp_manager.checkpoint_callback_params.save_best_model=False exp_manager.checkpoint_callback_params.save_nemo_on_train_end=False ++cluster_type=BCP
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 236, in main
trainer.fit(model)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 134, in run
self.on_advance_end()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 249, in on_advance_end
self.val_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 122, in run
return self.on_run_end()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 258, in on_run_end
self._on_evaluation_end()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 303, in _on_evaluation_end
call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 194, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_validation_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 360, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 663, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 714, in _update_best_and_save
self._save_checkpoint(trainer, filepath)
File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 383, in _save_checkpoint
super()._save_checkpoint(trainer, filepath)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 365, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1316, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 507, in save_checkpoint
self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 356, in save_checkpoint
dist_checkpointing.save(sharded_state_dict=checkpoint, checkpoint_dir=checkpoint_dir)
File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 278, in save
save_config(
File "/opt/megatron-lm/megatron/core/dist_checkpointing/core.py", line 76, in save_config
with config_path.open('w') as f:
File "/opt/megatron-lm/megatron/core/dist_checkpointing/core.py", line 76, in save_config
with config_path.open('w') as f:
File "/usr/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/result/checkpoints/megatron_gpt_sft--validation_loss=1.526-step=5-consumed_samples=640.0/metadata.json'
Also how can I adapt Tiny Shakespeare dataset?
By default, there is no /workspace/result
folder inside NeMo container. Can you try give an existing dir to exp_manager.explicit_log_dir
Also how can I adapt Tiny Shakespeare dataset?
SFT normally requires data to be in style of <instruction, response>. But the dataset you mentioned is not this type. Maybe you can use it as pretrain?
By default, there is no
/workspace/result
folder inside NeMo container. Can you try give an existing dir toexp_manager.explicit_log_dir
I mount my current working directory as /workspace and I already have result directory created at that path. so it should be valid path.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Describe the bug https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html
docker image: nvcr.io/nvidia/nemo:24.01.01.framework
converted llama2-70B hf model to Nemo using above doc, which is 129GB in size. I have disk space of 1.2T. while running on 2xH100 with 16Gpus total, I getting following error.
A clear and concise description of what the bug is.
Steps/Code to reproduce bug followed doc
https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Add any other context about the problem here. Example: GPU model