[1,9]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 550, in move
[1,13]<stdout>: os.rename(src, real_dst)
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp' -> '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents'
[1,13]<stdout>:
[1,13]<stdout>:During handling of the above exception, another exception occurred:
[1,13]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,13]<stdout>: "__main__", mod_spec)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>: exec(code, run_globals)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stdout>: main()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stdout>: run_command_line(args)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stdout>: run_path(sys.argv[0], run_name='__main__')
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,13]<stdout>: pkg_name=pkg_name, script_name=fname)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,13]<stdout>: mod_name, mod_spec, pkg_name, script_name)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>: exec(code, run_globals)
[1,13]<stdout>: File "train_net.py", line 306, in <module>
[1,13]<stdout>: main()
[1,13]<stdout>: File "train_net.py", line 298, in main
[1,13]<stdout>: model = train(cfg, args)
[1,13]<stdout>: File "train_net.py", line 165, in train
[1,13]<stdout>: per_iter_end_callback_fn=per_iter_callback_fn,
[1,13]<stdout>: File "/root/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/pytorch/maskrcnn_benchmark/engine/trainer.py", line 78, in do_train
[1,13]<stdout>: loss_dict = model(images, targets)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl
[1,13]<stdout>: result = hook(self, input)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 123, in forward_pre_hook
[1,13]<stdout>: self._close_writers()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/hook.py", line 433, in _close_writers
[1,13]<stdout>: self.writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/writer.py", line 201, in close
[1,13]<stdout>: self._writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 125, in close
[1,13]<stdout>: self._ev_writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 63, in close
[1,13]<stdout>: self.tfrecord_writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/record_writer.py", line 81, in close
[1,13]<stdout>: self._writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/access_layer/file.py", line 53, in close
[1,13]<stdout>: shutil.move(self.temp_path, self.path)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 564, in move
[1,13]<stdout>: copy_function(src, real_dst)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 263, in copy2
[1,13]<stdout>: copyfile(src, dst, follow_symlinks=follow_symlinks)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
[1,13]<stdout>: with open(src, 'rb') as fsrc:
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp'
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.
I am using a custom docker image to run distributed training with PyTorch on SageMaker. The training script is taken from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN. The DLC Image uses
pytorch-training:1.6.0-gpu-py3
as the base image.Following is the error traceback :
@Vikas-kum