awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

Open piyushghai opened 4 years ago

piyushghai commented 4 years ago

I am using a custom docker image to run distributed training with PyTorch on SageMaker. The training script is taken from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN. The DLC Image uses pytorch-training:1.6.0-gpu-py3 as the base image.

Following is the error traceback :

[1,9]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 550, in move
[1,13]<stdout>:    os.rename(src, real_dst)
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp' -> '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents'
[1,13]<stdout>:
[1,13]<stdout>:During handling of the above exception, another exception occurred:
[1,13]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,13]<stdout>:    "__main__", mod_spec)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stdout>:    run_command_line(args)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stdout>:    run_path(sys.argv[0], run_name='__main__')
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,13]<stdout>:    pkg_name=pkg_name, script_name=fname)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,13]<stdout>:    mod_name, mod_spec, pkg_name, script_name)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "train_net.py", line 306, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "train_net.py", line 298, in main
[1,13]<stdout>:    model = train(cfg, args)
[1,13]<stdout>:  File "train_net.py", line 165, in train
[1,13]<stdout>:    per_iter_end_callback_fn=per_iter_callback_fn,
[1,13]<stdout>:  File "/root/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/pytorch/maskrcnn_benchmark/engine/trainer.py", line 78, in do_train
[1,13]<stdout>:    loss_dict = model(images, targets)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl
[1,13]<stdout>:    result = hook(self, input)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 123, in forward_pre_hook
[1,13]<stdout>:    self._close_writers()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/hook.py", line 433, in _close_writers
[1,13]<stdout>:    self.writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/writer.py", line 201, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 125, in close
[1,13]<stdout>:    self._ev_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 63, in close
[1,13]<stdout>:    self.tfrecord_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/record_writer.py", line 81, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/access_layer/file.py", line 53, in close
[1,13]<stdout>:    shutil.move(self.temp_path, self.path)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 564, in move
[1,13]<stdout>:    copy_function(src, real_dst)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 263, in copy2
[1,13]<stdout>:    copyfile(src, dst, follow_symlinks=follow_symlinks)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
[1,13]<stdout>:    with open(src, 'rb') as fsrc:
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp'
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

@Vikas-kum

leleamol commented 4 years ago

The fix is checkedin in for 1.6 which avoids registering hook to non-training activities. It is currently under review.

Vikas-kum commented 3 years ago

@leleamol Can you point to fix PR?