Neuronx distributed Llama2 7B PyTorch Lightning example has fatal error when trying to save checkpoint after 100 global steps.
Logs are included below:
...
Epoch 0: 41%|████▏ | 6144/14876 [1:06:58<1:35:10, 1.53it/s, v_num=0, loss=6.000, lr=0.000285, input_ids=5.21e+7, throughput=24.60, global_step_step=95.00]step 96 loss is 6.002302169799805, lr is 0.00028799999999999995, throughput 24.631530041674928 seq/s, input_ids 35873595, norm tensor([3.2969], device='xla:0'), global rank 0
Epoch 0: 42%|████▏ | 6208/14876 [1:07:39<1:34:28, 1.53it/s, v_num=0, loss=6.000, lr=0.000288, input_ids=3.59e+7, throughput=24.60, global_step_step=96.00]step 97 loss is 5.956272125244141, lr is 0.00029099999999999997, throughput 24.63089836600094 seq/s, input_ids 37864961, norm tensor([3.1562], device='xla:0'), global rank 0
Epoch 0: 42%|████▏ | 6272/14876 [1:08:21<1:33:46, 1.53it/s, v_num=0, loss=5.960, lr=0.000291, input_ids=3.79e+7, throughput=24.60, global_step_step=97.00]step 98 loss is 5.935997486114502, lr is 0.000294, throughput 24.63518937768065 seq/s, input_ids 54373382, norm tensor([2.7344], device='xla:0'), global rank 0
Epoch 0: 43%|████▎ | 6336/14876 [1:09:02<1:33:03, 1.53it/s, v_num=0, loss=5.940, lr=0.000294, input_ids=5.44e+7, throughput=24.60, global_step_step=98.00]step 99 loss is 5.935565948486328, lr is 0.00029699999999999996, throughput 24.63476866876432 seq/s, input_ids 39750648, norm tensor([3.1250], device='xla:0'), global rank 0
Epoch 0: 43%|████▎ | 6400/14876 [1:09:44<1:32:21, 1.53it/s, v_num=0, loss=5.940, lr=0.000297, input_ids=3.98e+7, throughput=24.60, global_step_step=99.00][2024-04-10 16:02:47.461: I neuronx_distributed/parallel_layers/checkpointing.py:75] saving checkpoint to /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v1.ckpt
Traceback (most recent call last):
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
Traceback (most recent call last):
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
Traceback (most recent call last):
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 373, in <module>
_mp_fn(0, args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
_mp_fn(0, args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
train_llama(args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
train_llama(args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
trainer.fit(model=model, datamodule=dm)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
_mp_fn(0, args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 224, in _mp_fn
trainer.fit(model=model, datamodule=dm)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
train_llama(args)
File "/tmp/tmp/pytorchjob-nxd-llama2-7b-ptl-master-0/examples/training/llama2/lightning/run_llama_nxd_ptl.py", line 218, in train_llama
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
trainer.fit(model=model, datamodule=dm)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
results = function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
results = function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/launcher.py", line 71, in launch
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
results = function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
results = self._run_stage()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1036, in _run_stage
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.advance()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.fit_loop.run()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance()
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.advance(data_fetcher)
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
self.epoch_loop.run(self._data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
self.advance(data_fetcher)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 259, in advance
fn(trainer, trainer.lightning_module, *args, **kwargs)
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
self._save_topk_checkpoint(trainer, monitor_candidates)
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 303, in on_train_batch_end
self._save_monitor_checkpoint(trainer, monitor_candidates)self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 368, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 681, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
self._update_best_and_save(current, trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
self._update_best_and_save(current, trainer, monitor_candidates)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 733, in _update_best_and_save
self._save_checkpoint(trainer, filepath)self._save_checkpoint(trainer, filepath)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)trainer.save_checkpoint(filepath, self.save_weights_only)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
self._save_checkpoint(trainer, filepath)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 373, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "/usr/local/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1384, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
self.checkpoint_io.save_checkpoint(self.checkpoint_io.save_checkpoint(
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/strategy.py", line 195, in save_checkpoint
save(save(
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
self.checkpoint_io.save_checkpoint(xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/lightning/checkpoint_io.py", line 67, in save_checkpoint
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
save(
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
File "/usr/local/lib/python3.10/site-packages/neuronx_distributed/parallel_layers/checkpointing.py", line 100, in save
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
os.mkdir(path)
os.mkdir(path)
xser.save(checkpoint, chkpt_path, (not master_only), global_master=True)
FileNotFoundError File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 74, in save
FileNotFoundError: [Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_04_pp_rank_00_dp_rank_00.tensors':
[Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_03_pp_rank_00_dp_rank_00.tensors'
ref_data = _rewrite_data(_get_tensors_folder(path), data, should_write_data)
File "/usr/local/lib/python3.10/site-packages/torch_xla/utils/serialization.py", line 42, in _rewrite_data
os.mkdir(path)
FileNotFoundError: [Errno 2] No such file or directory: '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors'
[2024-04-10 16:02:56,192] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 252 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 254 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 257 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 258 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 259 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 260 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 261 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 262 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 263 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 264 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 265 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 266 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 267 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 268 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 269 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 270 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 271 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 272 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 273 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 274 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 275 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 276 closing signal SIGTERM
[2024-04-10 16:02:56,193] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 277 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 278 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 279 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 280 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 281 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 282 closing signal SIGTERM
[2024-04-10 16:02:56,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 283 closing signal SIGTERM
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 8] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 30] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 25] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 31] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 28] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 14] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 27] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 9] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 26] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 19] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 7] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 22] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 13] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 16] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 29] Received SIGTERM: 15
[2024-04-10 16:03:26,194] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 252 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 10] Received SIGTERM: 15
[2024-04-10 16:03:40,240] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 254 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 24] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 23] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 17] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 20] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 6] Received SIGTERM: 15
[2024-04-10 16:03:50,705] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 257 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:03:51,563] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 258 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:03:53,691] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 259 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 11] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 18] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 12] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 15] Received SIGTERM: 15
INFO:pytorch_lightning.trainer.connectors.signal_connector:[rank: 21] Received SIGTERM: 15
[2024-04-10 16:04:01,468] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 260 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:02,309] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 261 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:10,826] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 262 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:21,412] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 263 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:34,970] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 264 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:04:36,068] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 265 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:00,717] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 266 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:01,508] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 267 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:03,732] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 268 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:06,853] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 269 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:23,834] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 270 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:05:44,821] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 271 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:04,373] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 272 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:05,174] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 273 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:05,954] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 274 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:07,961] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 275 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:08,895] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 276 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:09,811] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 277 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:12,247] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 278 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:20,465] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 279 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:21,228] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 280 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:22,496] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 281 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:23,269] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 282 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:24,205] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 283 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-10 16:06:27,085] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 253) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_llama_nxd_ptl.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-10_16:02:56
host : pytorchjob-nxd-llama2-7b-ptl-master-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 255)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-04-10_16:02:56
host : pytorchjob-nxd-llama2-7b-ptl-master-0
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 256)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-10_16:02:56
host : pytorchjob-nxd-llama2-7b-ptl-master-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 253)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors
ls: cannot access '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt/tp_rank_01_pp_rank_00_dp_rank_00.tensors': No such file or directory
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt
ls: cannot access '/efs/home/nxd-llama2-7b-ptl/checkpoints/epoch=0-step=100-v2.ckpt': No such file or directory
root@attach-pvc:/efs/home/nxd-llama2-7b-ptl/logs/0£ ls -al /efs/home/nxd-llama2-7b-ptl/checkpoints/
total 24
Neuronx distributed Llama2 7B PyTorch Lightning example has fatal error when trying to save checkpoint after 100 global steps.
Logs are included below: