NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.51k stars 2.42k forks source link

No disk space left while loading llama2-70B for SFT #8784

Closed PurvangL closed 4 months ago

PurvangL commented 5 months ago

Describe the bug https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html

docker image: nvcr.io/nvidia/nemo:24.01.01.framework

converted llama2-70B hf model to Nemo using above doc, which is 129GB in size. I have disk space of 1.2T. while running on 2xH100 with 16Gpus total, I getting following error.

Traceback (most recent call last):
  File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 225, in main
    model = load_from_nemo(MegatronGPTSFTModel, cfg, trainer, gpt_cfg, modify_confg_fn=_modify_config)
  File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 116, in load_from_nemo
    model = cls.restore_from(
  File "/opt/NeMo/nemo/collections/nlp/models/nlp_model.py", line 465, in restore_from
    return super().restore_from(
  File "/opt/NeMo/nemo/core/classes/modelPT.py", line 450, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 1067, in restore_from
    loaded_params = super().load_config_and_state_dict(
  File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 143, in load_config_and_state_dict
    self._unpack_nemo_file(
  File "/opt/NeMo/nemo/core/connectors/save_restore_connector.py", line 572, in _unpack_nemo_file
    tar.extractall(path=out_folder)
  File "/usr/lib/python3.10/tarfile.py", line 2257, in extractall
    self._extract_one(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/usr/lib/python3.10/tarfile.py", line 2324, in _extract_one
    self._handle_fatal_error(e)
  File "/usr/lib/python3.10/tarfile.py", line 2320, in _extract_one
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/usr/lib/python3.10/tarfile.py", line 2403, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/usr/lib/python3.10/tarfile.py", line 2456, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/usr/lib/python3.10/tarfile.py", line 255, in copyfileobj
    dst.write(buf)
OSError: [Errno 28] No space left on device

A clear and concise description of what the bug is.

Steps/Code to reproduce bug followed doc

https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/llama2sft.html

WORLD_SIZE=16 srun  --kill-on-bad-exit=0 -N 2 --ntasks-per-node=8 --cpus-per-task=24 --ntasks=16                                                      
--container-image="docker://nvcr.io#nvidia/nemo:24.01.01.framework" --container-name=nemo_llama_slurm --container-mounts="${_cont_mounts}"           
  python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py    trainer.precision=bf16    trainer.devices=8    trainer.num_nodes=2    
trainer.val_check_interval=0.1    trainer.max_steps=50    model.restore_from_path=${MODEL}    model.micro_batch_size=1    model.global_batch_size=128 
   model.tensor_model_parallel_size=${TP_SIZE} model.activations_checkpoint_num_layers=1   model.pipeline_model_parallel_size=${PP_SIZE}    model.megatron_amp_O2=True 
 model.sequence_parallel=True    model.activations_checkpoint_granularity=full    model.activations_checkpoint_method=uniform    model.optim.name=distributed_fused_adam 
model.optim.lr=5e-6    model.answer_only_loss=True    model.data.train_ds.file_names=${TRAIN}    model.data.validation_ds.file_names=${VALID}  
model.data.test_ds.file_names=${TEST}    model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}    model.data.train_ds.max_seq_length=2048 
model.data.validation_ds.max_seq_length=2048    model.data.train_ds.micro_batch_size=1    model.data.train_ds.global_batch_size=128   
model.data.validation_ds.micro_batch_size=1    model.data.validation_ds.global_batch_size=128    model.data.test_ds.micro_batch_size=1 
model.data.test_ds.global_batch_size=256    model.data.train_ds.num_workers=0    model.data.validation_ds.num_workers=0    model.data.test_ds.num_workers=0 
model.data.validation_ds.metric.name=loss    model.data.test_ds.metric.name=loss    exp_manager.create_wandb_logger=False    exp_manager.explicit_log_dir=/tmp/results   
exp_manager.resume_if_exists=True    exp_manager.resume_ignore_no_checkpoint=True    exp_manager.create_checkpoint_callback=True
exp_manager.checkpoint_callback_params.monitor=validation_loss    exp_manager.checkpoint_callback_params.save_best_model=False    
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True    ++cluster_type=BCP

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here. Example: GPU model

qijiaxing commented 5 months ago

This is due to the /tmp folder inside your container does not have enough space. Because NeMo will untar the .nemo file into that folder, for 70B model, it needs a lot of space. You may mount an empty dir in host to /tmp in your container.

titu1994 commented 5 months ago

Right it's better to untar such large models with tar -xvf xyz.nemo /path and then use save restore connector to restore the model by explicitly stating the path of the extracted dir. There are some examples of this in inference scripts in LLM directories

titu1994 commented 5 months ago

Here is an example - https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_eval.py#L185-L187 and pass the connector to restore_from https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_eval.py#L238

PurvangL commented 5 months ago

Thanks @qijiaxing and @titu1994 for reply. applying tar command on .nemo files and updating model.restore_from_path to path solves issue I was facing.

PurvangL commented 5 months ago

@qijiaxing @titu1994 , reopening as getting following error after training all steps.

cmd

 WORLD_SIZE=16 srun  --kill-on-bad-exit=0 -N 2 --ntasks-per-node=8 --cpus-per-task=24 --ntasks=16              --container-image="docker://nvcr.io#nvidia/nemo:24.01.01.framework" --container-name=nemo_llama_slurm --container-mounts="${_cont_mounts}"             python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py    trainer.precision=bf16    trainer.devices=8    trainer.num_nodes=2    trainer.val_check_interval=1.0   trainer.max_steps=5    model.restore_from_path=${MODEL}    model.micro_batch_size=1    model.global_batch_size=128    model.tensor_model_parallel_size=${TP_SIZE} model.activations_checkpoint_num_layers=1   model.pipeline_model_parallel_size=${PP_SIZE}    model.megatron_amp_O2=True    model.sequence_parallel=False    model.activations_checkpoint_granularity=full    model.activations_checkpoint_method=uniform    model.optim.name=distributed_fused_adam    model.optim.lr=5e-6    model.answer_only_loss=True    model.data.train_ds.file_names=${TRAIN}    model.data.validation_ds.file_names=${VALID}    model.data.test_ds.file_names=${TEST}    model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}    model.data.train_ds.max_seq_length=512    model.data.validation_ds.max_seq_length=512    model.data.train_ds.micro_batch_size=1    model.data.train_ds.global_batch_size=128    model.data.validation_ds.micro_batch_size=1    model.data.validation_ds.global_batch_size=128    model.data.test_ds.micro_batch_size=1    model.data.test_ds.global_batch_size=256    model.data.train_ds.num_workers=0    model.data.validation_ds.num_workers=0    model.data.test_ds.num_workers=0    model.data.validation_ds.metric.name=loss    model.data.test_ds.metric.name=loss    exp_manager.create_wandb_logger=False    exp_manager.explicit_log_dir=/workspace/result    exp_manager.resume_if_exists=True    exp_manager.resume_ignore_no_checkpoint=True    exp_manager.create_checkpoint_callback=True    exp_manager.checkpoint_callback_params.monitor=validation_loss    exp_manager.checkpoint_callback_params.save_best_model=False    exp_manager.checkpoint_callback_params.save_nemo_on_train_end=False    ++cluster_type=BCP
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.                                                                                                                                           
Traceback (most recent call last):                                                                                                                                                                                    
  File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_sft.py", line 236, in main                                                                                                                       
    trainer.fit(model)                                                                                                                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit                                                                                                               
    call._call_and_handle_interrupt(                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt                                                                                            
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)                                                                                                                             
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch                                                                                      
    return function(*args, **kwargs)                                                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl                                                                                                         
    self._run(model, ckpt_path=ckpt_path)                                                                                                                                                                             
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run                                                                                                              
    results = self._run_stage()                                                                                                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage                                                                                                       
    self.fit_loop.run()                                                                                                                                                                                               
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run                                                                                                                
    self.advance()                                                                                                                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance                                                                                                            
    self.epoch_loop.run(self._data_fetcher)                                                                                                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 134, in run                                                                                                     
    self.on_advance_end()                                                                                                                                                                                             
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 249, in on_advance_end                                                                                          
    self.val_loop.run()                                                                                                                                                                                               
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator                                                                                                        
    return loop_run(self, *args, **kwargs)                                                                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 122, in run                                                                                                         
    return self.on_run_end()                                                                                                                                                                                          
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 258, in on_run_end                                                                                                  
    self._on_evaluation_end()                                                                                                                                                                                         
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 303, in _on_evaluation_end                                                                                          
    call._call_callback_hooks(trainer, hook_name, *args, **kwargs)                                                                                                                                                    
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 194, in _call_callback_hooks                                                                                                 
    fn(trainer, trainer.lightning_module, *args, **kwargs)                                                                                                                                                            
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_validation_end                                                                                      
    self._save_topk_checkpoint(trainer, monitor_candidates)                                                                                                                                                           
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 360, in _save_topk_checkpoint                                                                                  
    self._save_monitor_checkpoint(trainer, monitor_candidates)                                                                                                                                                        
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 663, in _save_monitor_checkpoint                                                                               
    self._update_best_and_save(current, trainer, monitor_candidates)                                                                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 714, in _update_best_and_save                                                                                  
    self._save_checkpoint(trainer, filepath)                                                                                                                                                                          
  File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 383, in _save_checkpoint                                                                                                                       
    super()._save_checkpoint(trainer, filepath)                                                                                                                                                                       
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 365, in _save_checkpoint                                                                                       
    trainer.save_checkpoint(filepath, self.save_weights_only)                                                                                                                                                         
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1316, in save_checkpoint                                                                                                  
    self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)                                                                                                  
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 507, in save_checkpoint                                                                           
    self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)                                                                                                                     
  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 356, in save_checkpoint                                                                                                                          
    dist_checkpointing.save(sharded_state_dict=checkpoint, checkpoint_dir=checkpoint_dir)                                                                                                                             
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 278, in save                                                                                                                        
    save_config(                                                                                                                                                                                                      
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/core.py", line 76, in save_config                                                                                                                           
    with config_path.open('w') as f:
  File "/opt/megatron-lm/megatron/core/dist_checkpointing/core.py", line 76, in save_config                                                                                                                           
    with config_path.open('w') as f:
  File "/usr/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/result/checkpoints/megatron_gpt_sft--validation_loss=1.526-step=5-consumed_samples=640.0/metadata.json'
PurvangL commented 5 months ago

Also how can I adapt Tiny Shakespeare dataset?

qijiaxing commented 5 months ago

By default, there is no /workspace/result folder inside NeMo container. Can you try give an existing dir to exp_manager.explicit_log_dir

qijiaxing commented 5 months ago

Also how can I adapt Tiny Shakespeare dataset?

SFT normally requires data to be in style of <instruction, response>. But the dataset you mentioned is not this type. Maybe you can use it as pretrain?

PurvangL commented 5 months ago

By default, there is no /workspace/result folder inside NeMo container. Can you try give an existing dir to exp_manager.explicit_log_dir

I mount my current working directory as /workspace and I already have result directory created at that path. so it should be valid path.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.