aws-samples / amazon-eks-machine-learning-with-terraform-and-kubeflow

Distributed training using Kubeflow on Amazon EKS
Apache License 2.0
79 stars 43 forks source link

Nueronx distributed Llama2 7B PyTorch Lightning example has fatal error during checkpointing #93

Open ajayvohra2005 opened 5 months ago

ajayvohra2005 commented 5 months ago

Neuronx distributed Llama2 7B PyTorch Lightning example has fatal error when trying to save checkpoint after 100 global steps.

Logs are included below:

...

Epoch 0:  41%|████▏     | 6144/14876 [1:06:58<1:35:10,  1.53it/s, v_num=0, loss=6.000, lr=0.000285, input_ids=5.21e+7, throughput=24.60, global_step_step=95.00]step 96 loss is 6.002302169799805, lr is 0.00028799999999999995, throughput 24.631530041674928 seq/s,  input_ids 35873595, norm tensor([3.2969], device='xla:0'), global rank 0
Epoch 0:  42%|████▏     | 6208/14876 [1:07:39<1:34:28,  1.53it/s, v_num=0, loss=6.000, lr=0.000288, input_ids=3.59e+7, throughput=24.60, global_step_step=96.00]step 97 loss is 5.956272125244141, lr is 0.00029099999999999997, throughput 24.63089836600094 seq/s,  input_ids 37864961, norm tensor([3.1562], device='xla:0'), global rank 0
Epoch 0:  42%|████▏     | 6272/14876 [1:08:21<1:33:46,  1.53it/s, v_num=0, loss=5.960, lr=0.000291, input_ids=3.79e+7, throughput=24.60, global_step_step=97.00]step 98 loss is 5.935997486114502, lr is 0.000294, throughput 24.63518937768065 seq/s,  input_ids 54373382, norm tensor([2.7344], device='xla:0'), global rank 0
Epoch 0:  43%|████▎     | 6336/14876 [1:09:02<1:33:03,  1.53it/s, v_num=0, loss=5.940, lr=0.000294, input_ids=5.44e+7, throughput=24.60, global_step_step=98.00]step 99 loss is 5.935565948486328, lr is 0.00029699999999999996, throughput 24.63476866876432 seq/s,  input_ids 39750648, norm tensor([3.1250], device='xla:0'), global rank 0
Epoch 0:  43%|████▎     | 6400/14876 [1:09:44<1:32:21,  1.53it/s, v_num=0, loss=5.940, lr=0.000297, input_ids=3.98e+7, throughput=24.60, global_step_step=99.00][2024-04-10 16:02:47.461: I neuronx_distributed/parallel_layers/checkpointing.py:75] saving checkpoint to /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v1.ckpt
Traceback (most recent call last):
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
Traceback (most recent call last):
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
Traceback (most recent call last):
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
    _mp_fn(0, args)
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
    _mp_fn(0, args)
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
    train_llama(args)
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
    train_llama(args)
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
    trainer.fit(model=model, datamodule=dm)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
    _mp_fn(0, args)
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
    trainer.fit(model=model, datamodule=dm)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
    train_llama(args)
  File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    trainer.fit(model=model, datamodule=dm)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    results = function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
        self.advance(data_fetcher)
self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
        call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
        fn(trainer, trainer.lightning_module, *args, **kwargs)
fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
        self._save_topk_checkpoint(trainer, monitor_candidates)
self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
        self._save_monitor_checkpoint(trainer, monitor_candidates)self._save_monitor_checkpoint(trainer, monitor_candidates)

  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
    self._save_monitor_checkpoint(trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
        self._update_best_and_save(current, trainer, monitor_candidates)
self._update_best_and_save(current, trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
    self._update_best_and_save(current, trainer, monitor_candidates)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
        self._save_checkpoint(trainer, filepath)self._save_checkpoint(trainer, filepath)

  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
        trainer.save_checkpoint(filepath, self.save_weights_only)trainer.save_checkpoint(filepath, self.save_weights_only)

  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
        self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
        self.checkpoint_io.save_checkpoint(self.checkpoint_io.save_checkpoint(

  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
    self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
        save(save(

  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
            self.checkpoint_io.save_checkpoint(xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)

  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
    save(
    ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)    
  File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
    os.mkdir(path)    
os.mkdir(path)
    xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)
FileNotFoundError  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
FileNotFoundError: [Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_04_pp_rank_00_dp_rank_00.tensors': 
[Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_03_pp_rank_00_dp_rank_00.tensors'
    ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
    os.mkdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors'
[2024-04-10 16:02:56,192] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 252 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 254 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 257 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 258 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 259 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 260 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 261 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 262 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 263 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 264 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 265 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 266 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 267 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 268 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 269 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 270 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 271 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 272 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 273 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 274 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 275 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 276 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 277 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 278 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 279 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 280 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 281 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 282 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 283 closing signal SIGTERM
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 8] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 30] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 25] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 31] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 28] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 14] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 27] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 9] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 26] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 19] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 7] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 22] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 13] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 16] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 29] Received SIGTERM: 15
[2024-04-10 16:03:26,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 252 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 10] Received SIGTERM: 15
[2024-04-10 16:03:40,240] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 254 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 24] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 23] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 17] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 20] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 6] Received SIGTERM: 15
[2024-04-10 16:03:50,705] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 257 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:03:51,563] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 258 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:03:53,691] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 259 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 11] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 18] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 12] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 15] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 21] Received SIGTERM: 15
[2024-04-10 16:04:01,468] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 260 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:02,309] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 261 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:10,826] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 262 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:21,412] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 263 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:34,970] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 264 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:36,068] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 265 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:00,717] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 266 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:01,508] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 267 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:03,732] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 268 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:06,853] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 269 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:23,834] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 270 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:44,821] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 271 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:04,373] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 272 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:05,174] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 273 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:05,954] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 274 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:07,961] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 275 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:08,895] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 276 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:09,811] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 277 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:12,247] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 278 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:20,465] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 279 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:21,228] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 280 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:22,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 281 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:23,269] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 282 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:24,205] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 283 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:27,085] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 253) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_llama_nxd_ptl.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-10_16:02:56
  host      : pytorchjob-nxd-llama2-7b-ptl-master-0
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-04-10_16:02:56
  host      : pytorchjob-nxd-llama2-7b-ptl-master-0
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-10_16:02:56
  host      : pytorchjob-nxd-llama2-7b-ptl-master-0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 253)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors
ls: cannot access '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors': No such file or directory
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt                                         
ls: cannot access '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt': No such file or directory
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/                        
total 24